Graphical abstract

   graphic file with name fx1.jpg
   [45]Open in a new tab

Highlights

     * •
       enONE, a computational framework of NAD-RNA, reduces technical
       noise using spike-in
     * •
       Human peripheral blood cells contain NAD-RNA and tend to increase
       with age
     * •
       RNA-based aging clock, integrating gene expression with NAD-RNA,
       can predict age
     * •
       NAD-capped RNA from circulating blood can be developed as potential
       biomarkers
     __________________________________________________________________

   Computational bioinformatics; Sequence analysis; Transcriptomics;
   Methodology in biological sciences

Introduction

   Nicotinamide adenine dinucleotide (NAD), an adenine nucleotide
   containing metabolite, can be incorporated into the RNA 5′-terminus to
   result in NAD-capped RNA (NAD-RNA),[46]^1^,[47]^2 which is different
   from eukaryotic canonical cap structure formed by 7-methylguanosine
   (m^7G) via a 5′-to-5′-triphosphate bridge (m^7G-RNA).[48]^3^,[49]^4 It
   has been estimated that NAD-capped forms constitute approximately 0.6%
   and 1.3% of the transcripts expressed in the mouse liver and kidney,
   respectively.[50]^5 To capture such low-level capping events, the
   recently developed NAD-RNA identification methods involve multi-steps
   of chemo-enzymatic reaction, followed by affinity-based
   enrichment.[51]^6^,[52]^7^,[53]^8 As a consequence, the resulting
   high-throughput sequencing data can be hampered by the effect of
   capture procedures and other unwanted variations, e.g., the batch
   effect. Given these limitations, current computational methods cannot
   be directly applied to the omic-level quantitative assessment of
   NAD-capped RNAs.

   Proper normalization may serve to remove unwanted variations from
   enrichment-based sequencing methods. N^6-methyladenosine (m^6A), a
   known epitranscriptomic modification in RNA, has been extensively
   characterized in virus and eukaryotic organisms by antibody-based
   immunoprecipitation and sequencing.[54]^9 Computational tools for
   m^6A-seq, e.g., RADAR[55]^10 and m^6A-express,[56]^11 employ a split
   scaling strategy that calculates scale factors for input and
   enrichment, respectively, to adjust the variations from enrichment
   procedures and sequencing depth. However, these analytical methods
   cannot properly account for the unwanted variation between samples with
   and without enrichment, thus challenging the identification of
   enrichment signals. More generally, current analyses of
   epitranscriptomic data are mostly based on normalization designed for
   bulk RNA-seq, e.g., scaling-based methods, such as total count (TC),
   trimmed mean of M values (TMM),[57]^12 and DESeq.[58]^13 The implicit
   assumption underlying scaling-based methods is that all the gene-level
   counts are proportional to scale factors and that the between-sample
   variations can be adequately adjusted by scale factors. Unfortunately,
   this assumption is inevitably violated when affinity-based enrichment
   selectively amplifies the signal of genes, e.g., m^6A and NAD capping,
   which leads to disproportional gene counts between input and
   enrichment. Another regression-based method, remove unwanted variation
   (RUV),[59]^14 regresses gene count measurements on unwanted factors,
   thus computing corrected expression values from the residuals. The
   implicit assumption underlying this method is that a set of negative
   controls, which are not affected by covariates of interest, is
   available, such as the spike-in from the External RNA Controls
   Consortium (ERCC). However, the ERCC-based method suffers from
   discrepancies between endogenous transcripts and spike-in, hindering
   its usage in omic-level profiling. Limited by current analytical
   methods, nuisance variations fail to be properly corrected for
   enrichment-based epitranscriptomic sequencing data, thereby obscuring
   true biological signals.

   NAD, an essential metabolite and redox regent, has been found to
   decline with age in crucial tissues of mammals including human.[60]^15
   Given the dynamics of NAD and gene expression over the course of adult
   lifespan, NAD-capped RNA, poised to integrate metabolomics and
   transcriptomics, may provide novel insights into physiological and
   perhaps pathological situations. Thus, it is tempting to explore how
   NAD-modified epitranscriptome is modulated with age. However, analysis
   of NAD-capped RNAs from heterogeneous populations, particularly those
   from aging human cohort, presents an additional challenge due to the
   interrelated variations from individual heterogeneity, impacts of
   aging, and the technical noises inherent in affinity-based enrichment
   procedures. Therefore, such complex datasets require a new analytical
   strategy for disentangling the covariates of interest.

   In the present study, we propose enONE computational framework for
   NAD-capped RNA data analysis by efficiently removing the impact of
   affinity-based enrichment variations using spike-ins. By identifying
   genes that capture unwanted variations from exogenous spike-in,
   referred to as the “anchor set”, enONE can transform data into a shared
   space, in the presence of extensive technical variations, e.g., batch
   effect. Based on the spike-in anchor set, enONE integrates global
   scaling and regression-based normalizations, followed by performance
   evaluation that selects the local-optimal normalization method to
   remove unwanted variation. As a computational approach, enONE, combined
   with NAD-RNA capture and sequencing technology, particularly our
   recently developed ONE-seq method,[61]^8 can facilitate quantitative
   analysis of NAD-RNA profiles. Using human aging cohort, we apply enONE
   to the identification of NAD-RNAs from human peripheral blood
   mononuclear cells (PBMCs), revealing critical features and dynamics of
   NAD modification with age. We further combined transcriptomic and
   epitranscriptomic features, to develop an RNA-based clock for the
   prediction of human chronological age.

Results

The workflow of enONE

   We designed a computational framework that implemented and evaluated
   the performance of various spike-in-based normalization methods for the
   analysis of NAD-RNA sequencing data. We thereby named our analytical
   method enONE, for Epitranscriptional NAD-capped RNA analysis by
   spike-in-based Omic-level Normalization and Evaluation ([62]Figure 1).

Figure 1.

   [63]Figure 1
   [64]Open in a new tab

   The workflow of enONE

   enONE starts with quality control to obtain high-quality sequence reads
   by removing outliers and lowly expressed genes. enONE performs
   spike-in-based normalization that integrates global scaling and
   regression-based procedures to generate normalization toolsets. Next,
   enONE uses eight data-driven metrics to evaluate normalization
   performance. By exploiting the full space of normalizations, enONE
   identifies the optimal procedure that maximally removes unwanted
   variations, while minimally impacting the signals from NAD-RNA-seq
   data.

   enONE initiates with a quality control step, to remove outlier samples
   and lowly expressed transcripts. Second, a subset of genes from total
   RNA spike-in, with their expressions presumably not being influenced by
   the covariates of interest (e.g., enrichment assay or biological
   condition), are used as the anchor set to estimate unwanted variation
   (e.g., batch effect). A generalized linear model is then applied to
   regress the observed read counts from anchor set on the unknown
   nuisance variables to estimate factors that are subsequently used by
   the normalization tools for the adjustment of unwanted variation.
   Third, a two-part normalization template is employed to define an
   ensemble of the normalization procedures: (1) global scaling of read
   counts to account for between-sample difference in sequencing depth and
   other parameters of the read count distribution and (2)
   regression-based adjustment for unwanted variations. For instance, one
   can apply a robust scaling procedure, such as TMM, followed by
   unsupervised procedures to estimate hidden unwanted variations and
   regress them out of the data (e.g., RUV[65]^14). Fourth, enONE compares
   different normalization tools to identify sets of top-performing
   procedures. Specifically, enONE calculates ranks based on eight
   performance metrics that represent the local-optimal trade-offs toward
   removing unwanted variation, preserving biological variation of
   interest, and maintaining minimal technical variability of global
   expression. Combined, enONE utilizes a data-driven approach to
   determine appropriate normalization procedures for the quantitative
   analysis of NAD-modified epitranscriptome.

Epitranscriptomic profiling of human PBMCs

   To gain insights of NAD-modified epitranscriptome during aging, we
   collected human PBMCs from an aging cohort of community subjects
   comprising of young (N = 23, age: 23–32), middle (N = 20, age: 40–50),
   and old (N = 18, age: 54–67) individuals for epitranscriptome-wide
   profiling of NAD-RNAs ([66]Figure 2A), according to the inclusion
   criteria approved by the Ethics Committee. Clinical characteristics of
   the participants were evaluated and listed in [67]Table S1.

Figure 2.

   [68]Figure 2
   [69]Open in a new tab

   The feasibility of enONE

   (A) Aging Cohort and experiment outline. A total of 61 participants
   (female/male = 31:30, age 23–67) were enrolled for NAD-modified
   epitranscriptome profiling. Total RNAs from PBMCs were mixed with
   Drosophila spike-in RNA, and two synthetic spike-in, one with 5%
   NAD-capped forms and another with 100% m^7G-capped forms. The mixture
   was subjected to NAD-RNA-seq, followed by enONE computational analysis
   of NAD-RNA profiles (highlighted in red).

   (B) Scatterplots of first two PCs for the 1,000 least significantly
   enriched genes (denoted as anchor set) from Drosophila spike-in.

   (C) Boxplot showing the normalization performance based on anchor set
   of different sizes from Drosophila spike-ins.

   (D) Fraction of the first two expression PCs variance explained values
   (taken cumulatively) for linear regression model using enrichment
   variable as the explanatory variable. PCs were computed from the
   variance stabilizing and transformed matrix of spike-in counts.

   (E) Scatterplots of first two PCs for Drosophila spike-in counts from
   different procedures.

   (F) The R^2 of linear regression between batch effect and up to the
   first six PCs (taken cumulatively).

   (G) Spearman’s correlation coefficients between the raw/normalized
   counts and batch effect.

   (H) The number of genes varied among bathes as inferred by ANOVA
   (p < 0.01).

   (I) Boxplot showing the relative expression levels of synthetic
   spike-ins with 5% NAD-caps and that with 100% m^7G-caps from different
   methods.

   Since NAD capping is an evolutionary conserved RNA modification,[70]^16
   we reasoned that exogenous total RNA from an invertebrate species can
   be used as common standards to mitigate unwanted variations from human
   samples. We deliberately included three types of spike-in RNAs: (1)
   total RNA from Drosophila melanogaster, an invertebrate model organism
   with well-annotated genome sequence; (2) a synthetic RNA consisting of
   5% NAD relative to m^7G-capped forms, used to determine the capture
   sensitivity; and 3) a synthetic RNA with 100% m^7G-capped form, used to
   determine the capture specificity ([71]Figure 2A). Notably, spike-ins 2
   and 3 were synthesized with templates from different sequences.
   Combined, we subjected total RNAs from human PBMCs, Drosophila,
   spike-in 2, and spike-in 3 to ONE-seq-based capture and sequencing,
   followed by enONE computational analysis of NAD-RNA profiles.

   After quality control, we obtained an average about 49.2 million
   high-quality and uniquely mapped sequencing reads from each library
   ([72]Figure S1A). Assessment of datasets corroborated that sequencing
   saturation has been reached ([73]Figure S1B). Spike-in 2, which
   contained 5% NAD-capped form, was significantly enriched, whereas no
   enrichment was found for spike-in 3 made up with 100% m^7G-RNA
   ([74]Figure S1C). Above evidence highlighted the sensitivity and
   specificity of the enrichment experiment, as reflected by the
   enrichment of NAD-capped, but not m^7G-capped, transcripts.

The feasibility of enONE

   Since all samples were added with equal amounts of Drosophila total RNA
   as spike-in, its disconcordance, if present, can be used to pinpoint
   the nuisance technical variation in an epitranscriptome-wide manner,
   and its concordance, on the other hand, can be used to validate the
   effect of normalization. To capture unwanted variation, i.e., batch
   effect, we used a set of genes (n = 1,000) whose expression patterns
   should be highly reproducible and now become differed among batches as
   the anchor set ([75]Figure 2B). In addition, we showed that
   normalization procedures were robust when the enrichment effect
   accounted for a small fraction of the anchor set variance, e.g., anchor
   set size ranged from 500 to 2,500 ([76]Figures 2C and 2D). With anchor
   set from Drosophila spike-in RNA, we implemented enONE normalization
   procedures with five scaling toolsets, including TC, UQ, TMM, DESeq,
   and PoissonSeq,[77]^17 as well as three regression-based procedures,
   namely RUVg, RUVs, and RUVse. By integrating two normalization modules,
   we generated a total of 96 combinatorial procedures for the current
   data ([78]Figure S2A). By inspecting the full space of normalization
   performance metrics, we found that the top-ranked procedure involved
   DESeq scaling followed by RUVg adjustment for the first 4 factors of
   unwanted variation ([79]Figures S2B and S2C). Additionally, this
   integrated procedure showed that the combination of global scaling and
   regression-based procedures could improve the normalization
   performance.

   To compare the effect of normalization, we applied three different
   methods—RUVg (k = 4), a regression-based method; RADAR, a scaling-based
   method; and enONE—on all three types of spike-in RNAs. Compared to
   other procedures, enONE normalization dramatically mitigated the batch
   effect of Drosophila spike-in in both input and enrichment libraries,
   while preserving the enrichment signals ([80]Figure 2E). In Drosophila
   spike-in, linear regression analysis between the first six PCs
   cumulatively and batch effect showed that batch effect explained over
   83% of the variations in raw count data. This variation was reduced to
   approximately 24% and 52% with the RUVg and RADAR methods,
   respectively. Significantly, enONE normalization further mitigated the
   variance explained by batch effect to approximately 17%
   ([81]Figure 2F). Analysis of correlation between individual gene
   expression values from Drosophila spike-in and batch variation revealed
   a large proportion of genes showing strong correlations with batch
   effect in raw, RUVg, and RADAR datasets, whereas this correlation was
   dramatically diminished by enONE ([82]Figure 2G). ANOVA was performed
   on Drosophila spike-in datasets from different batches to evaluate the
   impact of batch effect on gene expression measurements. Ideally, there
   should be minimal difference in gene expression across batches;
   however, over 5,800 genes were found to be differentially expressed in
   the raw dataset, reflecting strong batch effect. enONE was able to
   abolish the batch effect by reducing 96% of such differentially
   expressed genes, whereas RUVg and RADAR could reduce 90% and 62% of
   these genes, respectively ([83]Figure 2H). Since synthetic spike-ins
   were not used to estimate unwanted variation factors, their concordance
   could serve as an independent validation of normalization performance.
   The boxplots of raw relative log-expression showed large differences
   among synthetic spike-ins with equal input amounts ([84]Figure 2I).
   Compared to RUVg and RADAR, enONE clearly reduced the overall
   variations of synthetic spike-ins as illustrated by the lower
   interquartile range. Additionally, enONE normalization preserved the
   distribution differences between input and enrichment samples of
   NAD-capped, but not m^7G-capped, spike-ins, indicating that enONE could
   retain enrichment signals from transcripts with NAD-cap, but not
   m^7G-cap ([85]Figure 2I).

   We next compared the performance of enONE with other normalization
   approaches using human PBMCs dataset. Although the RUVg and RADAR
   reduced the variations, both methods had limitations; specifically,
   RUVg failed to preserve the enrichment signal and RADAR overcorrected
   the variation among enrichment samples ([86]Figure 3A). Principal
   component analysis plots illustrated that enONE improved upon the RUVg
   and RADAR normalizations in removing the variation of batch effect and
   preserving the enrichment signals ([87]Figure 3A). To examine batch
   effects in raw and normalized data, we performed linear regression
   analysis and ANOVA, which showed that enONE exhibited improved
   performance than RUVg and RADAR ([88]Figure 3B). Furthermore, we
   evaluated the preservation of covariate of interests, i.e., enrichment
   effect. By performing vector correlation analysis between the first six
   cumulative PCs and the enrichment covariate, we noted that over 97% of
   the enrichment variations were well preserved by enONE ([89]Figure 3B).
   Besides, the silhouette coefficient and adjusted Rand index analysis
   was used to assess the separation of input and enrichment samples.
   Consistently, this result showed that enONE outperformed other
   procedures in retaining the enrichment signals ([90]Figure 3C).
   Together, these results demonstrate the capacity of enONE in removing
   unwanted variation while retaining the covariates of interest.

Figure 3.

   [91]Figure 3
   [92]Open in a new tab

   Performance assessment of different normalizations on human PBMCs
   dataset

   (A) Top row: scatterplots of first two PCs for human PBMCs counts from
   different procedures colored by batch. Bottom row: same as the top row
   colored by the enrichment assay.

   (B) Left: the R^2 of linear regression between batch effect and up to
   the first six PCs (taken cumulatively). Right: the number of genes
   varied among bathes as inferred by ANOVA (p < 0.01).

   (C) Left: line chart showing the vector correlation coefficient between
   enrichment effect and the first six PCs (taken cumulatively). Right:
   scatterplot showing silhouette coefficient and adjusted Rand index
   (ARI) for measuring the separation of enrichment groups.

Characterization of NAD-RNAs from human PBMCs

   We proceeded to set 2-fold enrichment of read counts as the cutoff,
   which led us to identify a total of 809 NAD-RNAs from human PBMCs
   ([93]Figure 4A and [94]Tables S2, [95]S3, and [96]S4). We then
   characterized these newly identified NAD-RNAs. In human PBMCs, NAD
   capping mostly occurred on protein-encoding genes, but also extended to
   pseudogenes and non-coding RNAs, including lincRNA, snRNA, snoRNA, and
   miscRNA ([97]Figure 4A). NAD-RNAs were shown to be derived from genes
   localized on autosomes, X chromosomes, and the mitochondrion genome,
   but not from the Y chromosome ([98]Figure 4B). By dividing NAD-RNAs
   into 5 deciles based on enrichment, we observed that genes with short
   length and fewer introns tended to have increased modification of NAD
   ([99]Figure 4C), a pattern consistent with our recent study in mouse
   livers.[100]^8 Since previous studies have shown that RNA capping
   events can be influenced by the proximal RNA secondary
   structure,[101]^18 we examined the sequence feature using the 100 bp
   region downstream of the transcription start site. By leveraging the
   minimum free energy (MFE),[102]^19 we observed that transcripts with
   NAD-cap tended to adopt predicted structures with higher MFE level than
   those with m^7G-cap ([103]Figure 4D). Furthermore, transcripts with
   increased MFE level positively correlated with the extent of NAD
   modification ([104]Figure 4E), suggesting that RNA transcripts with a
   stable cap-proximal structure might promote NAD capping. To inspect NAD
   modification of genes associated with biological functions, we
   performed pathway enrichment analysis, which revealed that NAD-RNAs
   from human PBMCs were mainly involved in RNA metabolism, translation,
   transcription, energy metabolism, and immune system ([105]Figure 4F and
   [106]Table S5).

Figure 4.

   [107]Figure 4
   [108]Open in a new tab

   enONE facilitates NAD-capped RNAs identification

   (A) Barplot showing the gene types of identified NAD-RNAs (n = 809)
   from human PBMCs.

   (B) Circular bar plot showing the chromosomes distribution of
   identified NAD-RNAs.

   (C) From five deciles based on enrichment, genes with short length and
   with fewer introns tended to have increased modification of NAD. p
   values were obtained with ANOVA.

   (D) Left: for transcripts with 5′ untranslated region (UTR), the 100 bp
   region downstream of transcription start site (TSS) is used to compute
   the MFE. Using bootstrapping, the background distribution of MFE was
   generated based on the transcripts that did not produce NAD-cap. Right:
   transcripts that produced NAD-cap have predicted structures with higher
   MFE levels than those did not. p value was estimated with bootstrapping
   (n = 1,000).

   (E) From five deciles based on enrichment, genes with higher MFE levels
   tend to have increased NAD capping. p value was obtained with ANOVA.

   (F) Pathway analysis reveals the biological processes of NAD-capped
   RNAs that are mainly involved in RNA metabolism, transcription,
   translation, and immune system. Gray dashed line in the bar plot
   indicates the 0.05 FDR cutoff.

Age alters NAD-modified epitranscriptome

   To assess the heterogeneity of individual subjects, we utilized the
   Jensen-Shannon distance (JSD) metric to evaluate the variation in
   transcriptomic or epitranscriptomic distributions relative to their
   reference distribution, i.e., centroid.[109]^20^,[110]^21 A JSD value
   of 0 indicates an equal distribution between the reference and sample,
   while a JSD of 1 represents the greatest difference between the
   reference and sample. The transcriptome-derived JSD metrics had an
   average value of 0.10 and exhibited no significant correlation with
   ages (Pearson’s R = −0.036, p = 0.78). On the other hand, the JSD
   derived from epitranscriptome profiles had an average value of 0.12 and
   demonstrated a significantly negative correlation with age (Pearson’s
   R = −0.29, p = 0.025). These findings suggested a greater
   inter-individual heterogeneity in the distribution of epitranscriptome
   profiles, compared to transcriptome, and this heterogeneity tended to
   decrease with age ([111]Figure 5A). To further explore how NAD-RNAs
   were modulated with age and its consequent impact on the progression of
   aging, we analyzed NAD-RNA profiles from all age groups. Interestingly,
   even though NAD levels in circulating blood exhibited substantial
   inter-individual variations with advancing age,[112]^22 we found that
   the number of NAD capping events tended to increase in aged human
   subjects ([113]Figure 5B). To dissect this observation, we grouped
   age-associated trajectories into two major clusters using hierarchical
   clustering ([114]Figure 5C and [115]Table S6). For genes in each
   cluster, the age-associated dynamic of their transcript abundances and
   NAD modification levels did not exhibit positive correlations,
   suggesting that the transcriptome and NAD-modified epitranscriptome
   entailed distinct facets of the aging process ([116]Figures S3A and
   S3B). Furthermore, increased NAD modification was found for genes in
   cluster 1, with their function being involved in basic cellular events
   and adaptive immune response. NAD capping of genes associated with
   aminoacyl-tRNA biosynthesis and cellular respiration, which were
   ascribed as cluster 2, remained generally constant with age
   ([117]Figure 5C and [118]Table S7).

Figure 5.

   [119]Figure 5
   [120]Open in a new tab

   enONE reveals dynamics of NAD-capped RNAs during aging

   (A) Left: diagram illustrating the Jensen-Shannon distance. The
   Jensen-Shannon distance was used to quantify the dissimilarity between
   an individual’s profile and their corresponding reference profile,
   represented by the centroid. Right: scatterplot showing the
   relationship between age and the Jensen-Shannon distance from
   transcriptomic or epitranscriptomic profiles, respectively. The R
   represented the Pearson’s correlation coefficients and the p values
   were obtained from two-sided Pearson correlation test.

   (B) Heatmap showing epitranscriptomic profiles from individuals across
   age groups (N = 61). Top bar represented the enrichment signals
   (fold-change ≥2) of each sample.

   (C) Unsupervised hierarchical clustering was used to group NAD-RNAs
   with similar trajectory. Two major clusters were identified, and the
   top five non-redundant pathways were listed. The solid line and shaded
   region represent the smoothed trajectory and its 95% confidence
   intervals, respectively.

   (D) Heatmap showing a set of NAD-RNAs (n = 96) that highly associated
   with age.

   (E) Z-transformed, smoothed trajectories of NAD capping levels (in red)
   and expression levels (in black) of selected genes were shown. The
   solid line and shaded region represent the smoothed trajectory and its
   95% confidence intervals, respectively.

   (F) Assessment of gene-specific NAD capping by qRT-PCR. Selected
   NAD-RNAs, including age-increased NAD-RNAs (PDIA3 and HSF2), as well as
   age-decreased NAD-RNAs (UPF2 and NRGN), were examined. Data were shown
   in mean ± s.e.m. (one-sided Student’s t test).

   By inspecting the correlation between NAD modification and age, we
   identified a set of NAD-RNAs that highly associated with age (n = 96)
   ([121]Figure 5D). Specifically, select NAD-RNAs, such as those involved
   in protein folding (PDIA3), and heat shock response (HSF2) had
   increased capping with age ([122]Table S6), but the abundance at their
   RNA transcript levels was not increased ([123]Figure 5E). In addition,
   NAD capping of genes linked to mRNA decay (UPF2), and calmodulin
   binding (NRGN) were decreased during aging ([124]Figure 5E and
   [125]Table S6). The expression levels of UPF2 and NRGN also decreased
   during aging ([126]Figure 5E).

   To validate age-associated changes of NAD-RNA, we utilized a
   quantitative PCR strategy slightly modified from the CapZyme-Seq
   method.[127]^23 We introduced three types of spike-in RNAs: 1) a
   synthetic NAD-capped RNA as the positive control, 2) a synthetic
   m^7G-RNA as the negative control, and 3) a synthetic non-capped ppp-RNA
   as the baseline negative control. The RNA mixture was subjected to
   calf-intestinal alkaline phosphatase treatment to remove phosphate
   groups from the 5′ end of non-capped RNAs. Subsequently, NudC
   pyrophosphatase was introduced to hydrolyzed NAD-capped RNA, producing
   a ligatable 5′ monophosphate on these RNAs, followed by adapter
   ligation, adapter-specific cDNA synthesis, and PCR amplification
   ([128]Figure S4A). Control experiment showed that NAD-capped spike-in
   RNA, but not ppp- and m^7G-RNA, was selectively and significantly
   enriched ([129]Figure S4B). We proceeded to validate the NAD capping of
   four endogenous transcripts, including two transcripts with
   age-increased NAD modification (PDIA3 and HSF2), and two with
   age-decreased NAD capping (UPF2 and NRGN). In support of our findings
   at the epitranscriptome level, we found that NAD capping of PDIA3 and
   HSF2 was dramatically higher in aged compared to young individuals. On
   the other hand, UPF2 and NRGN had a significant decline of NAD capping
   in aged individuals ([130]Figure 5F).

RNA-based prediction of chronological age

   Aging is a complex process that entails both functional and molecular
   changes. Currently, RNA-based aging clock solely relies on the
   longitudinal change of select RNA transcripts, which limits the
   accuracy of prediction.[131]^24 Given the fact that NAD-RNAs
   intrinsically link metabolite with gene expression, we reasoned that
   RNA-based clock, if combined with an additional layer of NAD-modified
   epitranscriptome, might substantially improve the performance of
   prediction. To do this, we randomly split the aging cohort into two
   groups: the discovery cohort (N = 35), which served as the training
   set, and the validation cohort (N = 26), which was used for model
   validation. Both cohorts were balanced for ages and gender distribution
   ([132]Table S1). The performance of each model was subsequently
   evaluated by the median absolute error (MAE; unit, years) with
   leave-one-out cross-validation ([133]Figure 6A). We found that the
   transcriptome-based model (MAE = 7.1; Pearson’s R = 0.81 and p =
   5.5e-9) and epitranscriptome-based model (MAE = 5.2; Pearson’s R = 0.81
   and p = 3.9e-9) could generate prediction of chronological age. We
   further combined signature from transcriptome and epitranscriptome,
   yielding an improved prediction model (MAE = 4.2; Pearson’s R = 0.90
   and p = 1.4e-13) ([134]Figure 6B and [135]Tables S8, [136]S9, and
   [137]S10). To assess whether the trained model presented a
   dataset-specific bias, the combined model was applied to an independent
   validation cohort (N = 26). We obtained a comparable result from the
   validation cohort (MAE = 5.7; Pearson’s R = 0.81 and p = 4.5e-7),
   suggesting the model can be generalizable to the independent dataset
   ([138]Figure 6C). By employing machine learning approaches, we
   confirmed that our aging clock, which combined both transcriptomic and
   epitranscriptomic signatures, exhibited better performance compared to
   the models solely based on the transcriptome or epitranscriptome
   ([139]Figures S5A–S5F). These findings suggested that RNA-based aging
   clock with an additional epitranscriptomic layer could lead to improved
   age prediction accuracy. We assessed our model using elastic net
   regression ([140]Figures 6B and 6C), which showed outperformed result
   than models based on ridge, lasso, and random forest regression
   ([141]Figures S5A–S5F). Specifically, the features of combined aging
   clock were comprised of 20 transcriptomic and 11 epitranscriptomic
   signatures ([142]Figure 6D). 16 of them positively correlated with the
   progression of age, such as transcriptomic signatures involved in
   innate immune response (ANXA1), cell cycle (ZC3H12D), and apoptosis
   (VRK2). On the other hand, 15 features negatively associated with the
   progression of age, such as transcriptomic signatures involved in RNA
   processing (SRRM1), GPCR pathway (ARHGEF4), and epitranscriptomic
   signatures involved in mRNA decay (UPF2) ([143]Figure 6E). Moreover, we
   defined the age gap as subtracting real chronological age of human
   subjects from predicted age based on the combined aging clock, to
   investigate whether the accelerated (age gap >0) or decelerated
   biological ages (age gap <0) were associated with distinct
   physiological conditions. Using correlation analysis, we found that the
   advanced biological age was significantly associated with lower levels
   of blood cell count, such as neutrophil, white blood cell, and
   basophil, while positively correlated with blood cholesterol levels,
   including total cholesterol, and low-density lipoprotein
   ([144]Figure 6F and [145]Table S12). Together, these data highlight the
   potential of RNA-based aging clock for the prediction of chronological
   age and perhaps the physiological condition of individual subjects.

Figure 6.

   [146]Figure 6
   [147]Open in a new tab

   RNA-based age prediction model

   (A) Illustration of the development and validation of age prediction
   model.

   (B) Performance of age prediction models based on transcriptome (left),
   NAD-modified epitranscriptome (middle) and the combination of
   transcriptome and NAD-modified epitranscriptome (right) in the
   discovery cohort.

   (C) Performance of the combined age prediction model in the validation
   cohort. Each dot represents an individual. The blue solid line and
   shaded region represent a regression line and its 95% confidence
   intervals, respectively. The black dashed line represents the identity
   line. N is the number of individuals and n is the number of features.
   Numbers of epitranscriptomic features were colored in red.

   (D) Coefficients of features in the RNA-based aging clock that combined
   signatures from transcriptome and epitranscriptome.

   (E) Boxplot showing the age-associated trend of features contributing
   the most variance to the combined aging clock. Epitranscriptomic
   features were colored in red.

   (F) Age gap (predicted age – chronological age) was strongly associated
   with the blood cell count and blood cholesterol levels (|R[s]| > 0.25).
   p values were obtained with two-sided Spearman’s correlation test
   (∗p < 0.05, ∗∗p < 0.01).

   Combined, our study revealed the first NAD-modified epitranscriptome
   from human PBMCs. Using enONE, we were able to pinpoint
   epitranscriptomic alteration of NAD capping during the aging process.
   With a machine learning approach, we build an accurate RNA-based aging
   clock combing signatures from both transcriptome and epitranscriptome,
   which could predict the chronological age.

Discussion

   Incorporation of NAD, the hub metabolite and redox cofactor for cells,
   into RNA represents the crosstalk between metabolite and gene
   expression, defining a critical layer of epitranscriptomic regulation.
   NAD-RNA sequencing technologies, particularly our recently developed
   ONE-seq,[148]^8 have substantially facilitated the
   epitranscriptome-wide identification of novel NAD-capped RNAs across
   phyla.[149]^16 However, since these methods require affinity-based
   enrichment, data normalization simply borrowed from bulk RNA-seq could
   not be readily applied. In this study, we devise a general approach for
   epitranscriptome-wide NAD-RNA-seq data normalization and evaluation.
   Previous epitranscriptome profiling approaches have added low amount of
   ERCC spike-ins to account for the unwanted variation. However, ERCC
   spike-ins are only composed of 92 exogenous transcripts with length
   ranging from 250 to 2,000 nt, which cannot faithfully mimic the
   endogenous transcripts throughout stepwise enrichment procedures and
   library preparation.[150]^14^,[151]^25 Since NAD capping is an
   evolutionary conserved mechanism,[152]^16 our strategy deliberately
   includes Drosophila total RNA that comprises full spectrum of NAD
   epitranscriptome as spike-in. Particularly, we introduce relatively
   higher amount of fly RNA than that of human as surrogate for RNA loss
   resulted from affinity-based enrichment procedures. Since Drosophila is
   evolutionarily apart from human,[153]^26 sequencing reads can be easily
   distinguished in the subsequent data analysis. Although scaling-based
   approaches have been extensively used in the analysis of
   epitranscriptome profiles, we demonstrate that integration of global
   scaling and RUV strategy can optimize normalization performance (see
   [154]Figure S2). Selection of the “best” normalization, however, might
   not be feasible in practice due to the subjective definition of
   optimality.[155]^27^,[156]^28 Therefore, enONE emphasizes the choice of
   an appropriate rather than the “best” strategy. Collectively, enONE can
   efficiently remove the impact of unwanted noises arising from
   affinity-based enrichment in the heterogeneous human aging data while
   preserving the age-associated signals. Also, we highlight the
   application that enONE, an open-source R package, can be extended into
   the analysis of other types of epitranscriptomic sequencing data that
   involve stepwise enrichment procedures, e.g., m^6A-seq.

   The enONE computational framework facilitates the identification of
   NAD-RNA biomarkers. As a source of liquid biopsy, peripheral blood can
   serve as sentinel tissue to monitor individual health in a non-invasive
   manner. Identification of novel blood-derived features may open up new
   avenues in detection of biomarkers for various physiological and
   pathological conditions. Here, we reveal prominent features of NAD-RNAs
   from human PBMCs. Large collections of NAD-RNAs are produced by
   protein-encoding genes, with their biological functions mainly involved
   in basic cellular events, such as translation, RNA metabolism, and
   transcription. Additionally, genes with specialized function, such as
   those involved in immune system, are found to be capped by NAD. Thus
   far, yet the function of NAD-RNAs remains elusive, their dynamic
   changes may provide insights into how individual subjects respond to
   perturbations, such as the aging process. In line with the decline of
   NAD metabolite,[157]^29^,[158]^30 our recent mouse liver study has
   shown that the extent of NAD capping tends to decrease with age.[159]^8
   However, in contrast to biological replicates from a controlled
   laboratory environment, this pattern appears to be much more complex in
   individual human subjects. A recent study based on a large Chinese
   cohort (n = 1,518) has shown that peripheral blood NAD levels
   demonstrate remarkable inter-individual heterogeneity and
   gender-specific variation during aging.[160]^22 In the present study,
   we propose that NAD capping in PBMCs tends to increase in elder
   population (see [161]Figure 5B). This finding suggests that the
   addition or removal of 5′-terminus NAD moiety in different tissues
   might not be solely dependent on the cellular reserve of NAD. In
   addition, NAD-modified epitranscriptome might be remodeled by other
   mechanisms during aging, such as the activity of NAD-RNA decapping
   enzymes[162]^31 and small RNAs that destabilize NAD-capped
   transcripts.[163]^32 Moreover, our data illustrate some of the
   age-associated NAD-RNAs that are involved in molecular pathways
   profoundly impinging on the hallmarks of aging, such as mitochondrial
   function and immune system.[164]^33 During aging, mitochondrial
   function progressively deteriorates, compromising the homeostasis of
   energy metabolism and redox status.[165]^33 Previous studies have
   reported that mitochondrial genome can produce NAD-RNAs in various
   eukaryotic organisms, including yeast,[166]^34 Arabidopsis,[167]^35 and
   human.[168]^36 Here, our data reveal a substantial proportion of
   transcripts, including 30 out of the 37 mitochondria-encoded
   transcripts and 81 mitochondria-related genes from nucleus, can
   generate NAD-capped RNAs. Among them, we note an age-associated
   increase in NAD capping of transcripts associated with essential
   mitochondrial functions, such as oxidative phosphorylation. These
   findings suggest that NAD capping may provide novel insights into
   mechanisms that modulate mitochondrial activities with age. In
   addition, increased inflammation, characterized by an elevation of
   circulating inflammatory factors, is another key hallmark of
   aging.[169]^33 Our data highlight that the increase of NAD modification
   of genes related to adaptive immune system represents a new feature in
   the system-wide manifestation of inflammation during human aging.

   Our preliminary study provides proof-of-concept evidence that
   blood-derived, RNA-based measures can be used to predict the
   chronological age of human. Additionally, our new RNA-based aging
   clock, which combines an additional layer of NAD-modified
   epitranscriptome, exhibits a better age prediction performance than
   those solely based on the abundance of transcripts, e.g., bulk
   RNA-seq[170]^37 and recently single-cell RNA-seq.[171]^38 Since there
   is currently no gold standard measure of biological aging,[172]^39 it
   is tempting to exploit how NAD-capped RNAs as biological implications
   might signify physiological and perhaps pathological conditions.

   Taken together, we propose enONE as a flexible, modular, and general
   framework for the spike-in-based normalization and evaluation of
   NAD-RNA sequencing data in a wide spectrum of biological contexts. To
   the best of our knowledge, enONE is the first computational approach
   for NAD-modified epitranscriptome profiles, which can be extended into
   the analysis of other types of epitranscriptomic sequencing data that
   involve stepwise enrichment procedures, e.g., m^6A-seq. Using enONE, we
   identify hundreds of NAD-capped transcripts from human peripheral blood
   and reveal age-associated dynamics of NAD capping that are linked to
   biological processes impinging on aging hallmarks. Potentially,
   blood-derived NAD-RNAs can be exploited as non-invasive biomarkers for
   aging.

Limitations of the study

   The present study mainly focusses on disentangling the age-associated
   features of NAD-capped epitranscriptome from human peripheral blood
   using enONE. However, the overall number of NAD-RNAs from individual
   subjects is not simply correlated with age (see [173]Figure 5B),
   raising the possibility that additional age-associated features remain
   to be interrogated. For instance, gender effect has been shown to
   impact the fundamental metabolism[174]^40 and gene expression
   programs.[175]^41^,[176]^42 How NAD-capped RNAs, which represent the
   crosstalk between the metabolome and transcriptome, vary between
   different genders warrants future investigation. In addition, our human
   aging cohort is a relatively small-scale population recruited from the
   local community of Shanghai, which does not represent the general
   population. That how this set of RNA biomarkers compares with existing
   DNA-based quantification is worth of future investigation in a larger
   and more diverse cohort.[177]^43

STAR★Methods

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Biological samples
     __________________________________________________________________

   D. melanogaster: 5905 Bloomington Drosophila Stock Center RRID:
   BDSC_5905
     __________________________________________________________________

   Chemicals, peptides, and recombinant proteins
     __________________________________________________________________

   yDcps New England Biolabs M0463S
   Glycoblue Thermo Fisher Scientific AM9516
   RRI Takara Bio 2313A
   TRizol Takara Bio 9109
   CIAP Thermo Fisher Scientific 18009–027
   NudC Pyrophosphatase New England Biolabs M0607S
   T4 RNA Ligase 1 (ssRNA Ligase) New England Biolabs M0204S
     __________________________________________________________________

   Critical commercial assays
     __________________________________________________________________

   Fast RNA-seq Lib Prep Kit V2 Abclonal RK20306
   ABScript II cDNA First-Strand Synthesis Kit Abclonal RK20400
   ABScript III RT Master Mix for qPCR Abclonal RK20428
   MEGAscript T7 Transcription Kit Thermo Fisher Scientific AM1334
     __________________________________________________________________

   Deposited data
     __________________________________________________________________

   Raw and analyzed data Gene Expression Omnibus [178]GSE226636
     __________________________________________________________________

   Experimental models: Organisms/strains
     __________________________________________________________________

   D. melanogaster: 5905 Bloomington Drosophila Stock Center RRID:
   BDSC_5905
     __________________________________________________________________

   Oligonucleotides
     __________________________________________________________________

   List of primers and other oligonucleotides, see [179]Table S13 This
   paper N/A
     __________________________________________________________________

   Software and algorithms
     __________________________________________________________________

   enONE This paper [180]https://github.com/thereallda/enONE
   [181]Open in a new tab

Resource availability

Lead contact

   Additional information and requests for resources and reagents used in
   this study should be directed to and will be fulfilled by the lead
   contact, Nan Liu, (liunan@sioc.ac.cn).

Materials availability

     * This study did not generate new unique reagents.

Data and code availability

     * •
       All high-throughput RNA sequencing data as well as transcript
       quantifications have been deposited at the Gene Expression Omnibus
       under accession number [182]GSE226636
       .
     * •
       All original code has been deposited at Zenodo
       ([183]https://doi.org/10.5281/zenodo.10068529). enONE is available
       as an R package on GitHub
       ([184]https://github.com/thereallda/enONE). The datasets and the
       code are publicly available as of the date of publications.
     * •
       Any additional information required to reanalyze the data reported
       in this paper is available from the [185]lead contact upon request.

Experimental model and study participant details

   The project was approved by the Medical Ethics Committee of Shanghai
   Changzheng Hospital (2022SL006). Informed consent was obtained from all
   subjects in accordance with the local research Ethics Committee
   guidelines. We conducted a cross-sectional study of natural aging in
   community subjects recruited from Shanghai Chang Zheng Hospital. The
   aging cohort consists of 31 females and 30 males, aged from 23 to 67
   years old. Details of participant information were listed in
   [186]Table S1. Inclusion criteria for cohort were: 1) above 20 years
   old; 2) independently able to provide written informed consent. For
   comparative analysis of age-related subgroups, the cohort was divided
   into three age groups: Young (20–35 years old), Middle (36–50 years
   old), and Old (51 years old and above). Height and weight are measured
   by trained staff following standardized protocols. BMI (kg/m^2) is
   derived from the calculation and stored for further analyses. After
   5 minutes of rest in the seated position, blood pressure (mmHg) was
   measured three times with an automatic sphygmomanometer and the mean of
   the measurements was used for analysis. Blood samples were taken from
   all patients in the morning after they had been seated for 5 minutes.
   All participants had blood drawn using lithium heparin tubes (BD
   Vacutainer, catalog: 367884) by phlebotomists and consented to having
   their de-identified survey data made publicly available.

Method details

Isolation of peripheral blood mononuclear cells (PBMCs) from whole blood

   For isolation of peripheral blood mononuclear cells (PBMCs), 5 mL whole
   blood was mixed with 630 μL OptiPrep (Sigma-Aldrich, catalog: D1556)
   and 500 μL solution C (0.85% (w/x) NaCl and 10 mM Tris-HCl, pH 7.4),
   followed by centrifugation at 4°C and 1,300 g for 30 min. PBMCs were
   collected and mixed with two volumes of solution C, followed by
   centrifugation at 4°C and 500 g for 10 min. Cell pellet was washed
   twice with 1 mL PBS. The suspension was used for NAD-RNA detection.

In vitro transcription of NAD-RNA, and m^7G-RNA

   To assess the sensitivity of enrichment procedures, spike-in NAD-RNA
   (500 nt; sequence A) and m^7G-RNA (500 nt; sequence A) with identical
   sequence, oligonucleotide without adenine was synthesized (Genewiz) and
   were subjected to polyadenylation for poly(A) tails elongation
   ([187]Table S13). To assess the specificity of enrichment procedures,
   spike-in m^7G-RNA (500 nt; sequence B) oligonucleotide was synthesized
   (Genewiz) and were subjected to polyadenylation for poly(A) tails
   elongation ([188]Table S13). To prepare the baseline control for qPCR,
   spike-in ppp-RNA (500 nt) oligonucleotide was synthesized (Genewiz) and
   were subjected to polyadenylation for poly(A) tails elongation
   ([189]Table S13). To prepare the positive control for qPCR, spike-in
   NAD-RNA (500 nt) oligonucleotide was synthesized (Genewiz) and were
   subjected to polyadenylation for poly(A) tails elongation
   ([190]Table S13). For in vitro transcription, 10 μM of double-stranded
   DNA (dsDNA) template in 100 μL transcription buffer (Promega, catalog:
   P1300), along with 1 mM of each of GTP, CTP and UTP, with 4 mM NAD (for
   NAD-RNA) or 4 mM m^7GpppA (New England Biolabs, catalog: S1406S) (for
   m^7G-RNA), 10 μL of T7 RNA polymerase (Promega, catalog: P1300), 5%
   DMSO, 5 mM DTT and 2.5-unit RNase inhibitor were added and the
   transcription mixture was incubated at 37°C for 4 h. The reaction was
   incubated with 11-unit DNase I (Promega, catalog: P1300) at 37°C for
   30 min to remove the DNA template. RNA was then extracted using acid
   phenol/chloroform and precipitated with isopropanol (with 0.3 M sodium
   acetate, pH 5.5) at −80°C overnight. The RNA pellet was washed twice
   with 75% ethanol, air-dried, re-dissolved in DEPC-treated H[2]O, and
   stored at −80°C.

NAD-capped RNA sequencing

   Total RNAs from human PBMCs and total RNAs from Drosophila (spike-in)
   were prepared in accordance with the manufacturer’s instruction (Takara
   Bio, catalog: 9108). Total RNAs (10 μg) from human PBMCs were mixed
   with 40 μg Drosophila RNA (spike-in 1), 0.1 ng synthetic RNAs (spike-in
   2: 5% NAD-RNA/95% m^7G-RNA; sequence A), and 0.1 ng synthetic RNAs
   (spike-in 3: 100% m^7G-RNA; sequence B). The mixture of total RNAs and
   spike-in RNAs was incubated with 100 mM HEEB (1 M stock in DMSO) with
   ADPRC (25 μg/mL) in 100 μL of ADPRC reaction buffer (50 mM Na-HEPES pH
   7.0, 5 mM MgCl[2]) at 37°C for 1 h. 100 μL of DEPC-treated H[2]O was
   then added and acid phenol/ether extraction was performed to stop the
   reaction. RNAs were precipitated by ethanol and re-dissolved in 100 μL
   of DEPC-treated H[2]O. 5 μL of biotinylated RNAs were kept as input.
   After HEEB reaction, biotinylated RNAs were incubated with streptavidin
   bead particles (6 μL, MedChemExpress, catalog: HY-K0208) and 0.4 U/μL
   of RNase Inhibitor (Takara Bio, catalog: 2313B) at 25°C for 30 min.
   Beads were washed four times with streptavidin wash buffer (50 mM
   Tris-HCl (pH 7.4) and 8 M urea), and three times with DEPC-treated
   H[2]O. To ensure complete elution, biotin-conjugated RNAs were replaced
   from streptavidin beads by incubating with 1 mM biotin buffer (20 μL
   NaOH pH 8.0, Sigma-Aldrich, catalog: B4639) at 94°C for 8 min, followed
   by incubation with 500 nM NudC (New England Biolabs, catalog: M0607S)
   in 25 μL of NudC reaction buffer (100 mM NaCl, 50 mM Tris-HCl pH 7.9,
   10 mM MgCl[2], 100 μg/ml Recombinant Albumin) at 37°C for 30 min. After
   NudC treatment, biotinylated-RNAs that are resistant to NudC catalysis,
   potentially derived from contaminating m^7G-RNAs, were retained on
   beads by incubation with high-capacity streptavidin particle (20 μL,
   Thermo Fisher Scientific, catalog: 20357) at 25°C for 30 min. Eluted
   RNAs in the supernatant were used for next step. Input (see above) and
   NudC-eluted RNAs were used for NGS library construction, in accordance
   with the manufacturer’s instructions (mRNA-seq Lib Prep Kit for
   Illumina, Abclonal, catalog: RK20302). Library quality was assessed by
   Bioanalyzer 2100 (Agilent, United States), and quantification was
   performed by qRT-PCR with a reference to a standard library. Libraries
   were pooled together in equimolar amounts to a final 2 nM concentration
   and denatured with 0.1 M NaOH (Sigma, catalog: 72068). Libraries were
   sequenced on the Illumina NovaSeq 6000 system (paired end; 150 bp).

Validation of NAD-capped RNA by a modified CapZyme-Seq method

   Total RNAs (3 μg) from human PBMCs of young (n = 3) and old (n = 3)
   individuals were mixed with Drosophila total RNA (7 μg), as surrogate
   for RNA loss, and polyadenylated spike-in RNAs generated from different
   sequence templates, including 0.1 ng NAD-RNA (500 nt), 0.1 ng
   m^7Gppp-RNA (500 nt), and 0.1 ng ppp-RNA (500 nt). RNAs were incubated
   with 10 U CIAP (Thermo Fisher Scientific, catalog: 18009019) in the
   presence of 40 U RNaseOUT at 37°C for 1 h to remove 5′-terminal
   phosphate from non-capped RNAs. After purification with RNAiso Plus
   (Takara Bio, catalog: 9108), CIAP-treated RNAs were incubated with
   250 nM NudC (New England Biolabs, catalog: M0607S) in 40 μl of NudC
   reaction buffer (100 mM NaCl, 50 mM Tris–HCl pH 7.9, 10 mM MgCl2,
   100 μg/ml Recombinant Albumin, 40 U RNaseOUT) at 37°C for 30 min. This
   step aims to hydrolyze the cap from NAD-RNA resulting in ligatible
   monophosphate at the 5′ end. Products were purified with RNAiso Plus
   (Takara Bio, catalog: 9108) and further ligated with 100 μM 5′ adaptor
   oligo listed in [191]Table S13, in the presence of 15 U T4 RNA ligase 1
   (New England Biolabs, catalog: M0202), 15% of PEG8000, 1 mM ATP and
   40 U RNaseOUT in 40 μl of 1× T4 RNA ligase buffer at 16°C for 16 h.
   RNAs were purified and re-dissolved in 20 μl of DEPC-treated H[2]O.
   Reverse transcription was performed with designed primer listed in
   [192]Table S13, by ABScript II cDNA First-Strand Synthesis Kit
   (ABclonal, catalog: RK20400). 10 μl of products was diluted to 200 μl
   and served as input. Meanwhile, 5 μl of reversely transcribed products
   were further amplified by PCR using adaptor-specific primers listed in
   [193]Table S13. The PCR program operated with an initial denaturation
   step of 3 min at 95°C, amplification for 14 cycles (denaturation for
   15 s at 95°C, annealing for 15 s at 60°C and extension for 6 min at
   72°C), and a final extension for 5 min at 72°C. The ppp-RNA (500 nt)
   was used as the baseline control, with NAD-RNA (500 nt) serving as the
   positive control, and m^7G-RNA (500 nt) serving as the negative
   control. Primers were listed in [194]Table S13. To calculate the
   enrichment from qRT-PCR data, the Ct value of the target gene was first
   normalized to the Ct of the ppp-RNA (baseline). Next, the normalized
   PCR+ fraction value (ΔCt of the target gene normalized to the ppp-RNA)
   was normalized to the background (ΔCt calculation for the gene in the
   input), to yield the ΔΔCt value. The linear conversion of this ΔΔCt
   resulted in the fold enrichment. Significance was assessed by Student’s
   t test.

Quantification and statistical analysis

High-throughput sequencing data analysis

   All sequencing reads were processed with Trim Galore (v0.6.6)[195]^44
   with the parameters “--nextseq 30 --paired” to remove the adapter
   sequences (AGATCGGAAGAGC) from NovaSeq-platforms, and reads longer than
   20 bp were kept. Reads that passed the quality control procedure were
   kept and mapped to the combined genome of Homo sapiens (GRCh38) and
   Drosophila melanogaster (dmel-all-chromosome-r6.36) using STAR
   (v2.7.6a)[196]^45 with default parameters. Uniquely mapped read pairs
   were counted against annotations from Homo sapiens (Ensembl: GRCh38.94)
   and Drosophila melanogaster (Flybase: dmel-all-r6.36) using
   featureCounts (v2.0.1)[197]^46 with parameters “-p -B -C” and
   summarized as gene-level counts. Sequencing saturation was assessed by
   randomly subsampling the original libraries and examined the
   corresponding changes in the number of genes, derived from human
   genome, with more than 10 read counts.

enONE workflow

   enONE is implemented in R and publicly available at
   [198]https://github.com/thereallda/enONE. enONE workflow consists of
   four steps: 1. Quality control; 2. Gene set selection; 3. Normalization
   procedures; 4. Normalization performance assessment. By “log”
   transformation, we generally refer to the log[2](x+1) function unless
   otherwise stated. Below, steps are shown in detail.

Quality control

   The goal of quality control was to remove problematic or noisy
   observations from downstream analysis. In this study, we used sample
   and gene filtering to control data quality. To assess outliers, we
   applied Rosner’s outlier test on principal component 1. Principle
   component analysis (PCA) was performed with prcomp function on the top
   20,000 genes based on a transformed counts matrix by vst function from
   R package DESeq2 (v1.36.0).[199]^13 All samples were kept for
   subsequent analysis. To keep well-detected genes across samples, we
   used filterByExpr function from the R package edgeR (v3.38.4)[200]^47
   with parameter “min.count=20”. All ribosomal RNA encoded genes and TEC
   genes were excluded.

Gene set selection

   enONE defined three sets of control genes for adjustment of the
   unwanted variations, evaluation of the unwanted variations, and
   evaluation of the wanted variations, respectively. For adjustment of
   the unwanted variation, we defined the 1,000 least significantly
   enriched genes (denoted as anchor set) in Drosophila spike-ins, ranked
   by likelihood ratio, as negative control genes. Since the effect of
   affinity-based enrichment was the known covariate of interest, we used
   these genes, that were not affected by the enrichment effect, to
   compute the unwanted variation in the subsequent RUV procedure. For
   evaluation of the unwanted variation, we defined the 500 least
   significantly varied genes in human, ranked by likelihood ratio, as
   negative evaluation genes. We performed differential analysis test
   across all covariates of interest to determine genes with constant
   expression levels. Since the variation of constant genes could reflect
   the handling effects, those genes could be used to evaluate the removal
   of unwanted variation in the subsequent normalization evaluation step.
   For evaluation of the wanted variation, we defined the 500 most
   significantly enriched genes in human, ranked by likelihood ratio, as
   positive evaluation genes. We performed differential analysis test
   between enrichment and input samples in human to determine genes
   affected by enrichment procedures. Those genes were used to evaluate
   the preservation of wanted variation in the subsequent normalization
   evaluation step.

Normalization procedures

   enONE implemented global scaling and regression-based methods for the
   generation of two-part normalization schemes. For the global scaling
   normalization methods, gene-level read counts were adjusted by a
   sample-wise scale factor. This procedure encompassed five distinct
   scaling methods, including: 1) Total Count (TC): The scale factor was
   defined as the sum of the read counts across all genes within a sample.
   2) Upper-Quartile (UQ): The scale factor was computed as the
   upper-quartile of the gene-level count distribution for each sample. 3)
   TMM[201]^12: The scale factor was determined using a robust estimation
   of the overall expression fold change between a given sample and a
   reference sample. TMM was performed by the calcNormFactors function
   from R package edgeR with parameter: method = “TMM”. The default
   criterion for reference sample selection was based on its upper
   quartile being the closest to the mean upper quartile value across all
   samples. 4) DESeq[202]^13: The scale factor for a given sample was
   calculated as the median fold-change between the samples and a
   pseudo-reference sample, whose counts were defined as the geometric
   means of the counts across samples. This method was performed by the
   calcNormFactors function from R package edgeR with parameter: method =
   “RLE”. Note that the method discarded any gene having zero count in at
   least one sample. 5) PoissonSeq: The scale factor was derived based on
   an iterative estimate of sequencing depth, considering a subset of
   genes characterized by Poisson goodness-of-fit statistics.[203]^17 By
   default, this method discarded genes with less than 5 read counts.

   For the regression-based procedures, enONE considered the following
   generalized linear model, which allows adjustment for factors of
   unwanted variation:
   [MATH: <mrow><mi>Y</mi><mo
   linebreak="badbreak">=</mo><mrow><mi>X</mi><mi>β</mi></mrow><mo
   linebreak="goodbreak">+</mo><mrow><mi>W</mi><mi>α</mi></mrow><mo
   linebreak="goodbreak">+</mo><mi>O</mi><mtext>,</mtext></mrow> :MATH]
   (Equation 1)

   where Y was the n x j matrix of gene-level read counts, X was an n x m
   design matrix corresponding to the m covariates of interest of ‘‘wanted
   variation’’ (e.g., treatment) and β its associated m x j matrix of
   parameters of interest, W was an n x k matrix corresponding to unknown
   factors of unwanted variation and α its associated k x j matrix of
   nuisance parameters, O was an n x j matrix of offsets that can either
   be set to zero or estimated with other normalization procedure (such as
   TMM normalization). Here, n is the number of genes, j is the number of
   samples, m is the number of covariates, and k is the chosen number of
   unwanted variation factors.

   The unwanted variation W was estimated via singular value decomposition
   (SVD) using approaches implemented in R package RUVSeq
   (v1.30.0).[204]^14 In this study, we introduced three variants of RUV,
   including: 1) RUVg: This variant estimated the unwanted variation
   factors based on negative control genes that were assumed to exhibit
   constant expression across all samples (β = 0). 2) RUVs: This method
   estimated the factors of unwanted variation based on negative control
   genes derived from the replicate samples, such as those within each
   treatment group, where the covariates of interest remained constant
   (β = 0). 3) RUVse: This novel variant was a modified version of RUVs.
   It estimated the unwanted variation factors using negative control
   genes from the replicate samples within each assay group (i.e.,
   enrichment and input), with the assumption that the enrichment effect
   remained constant. Following the estimation of unwanted variations, for
   a chosen value of k (1 ≤ k ≤ j), the adjusted read counts Y∗ can be
   defined by
   [MATH: <mrow><msup><mi>Y</mi><mo>∗</mo></msup><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mi>Y</mi><mo>−</mo><msup><mover
   accent="true"><mi>W</mi><mo>ˆ</mo></mover><mrow><mo>(</mo><mi>k</mi><mo
   >)</mo></mrow></msup><msup><mover
   accent="true"><mi>α</mi><mo>ˆ</mo></mover><mrow><mo>(</mo><mi>k</mi><mo
   >)</mo></mrow></msup><mo>−</mo><mi>O</mi><mtext>.</mtext></mrow> :MATH]

Normalization performance evaluation

   To evaluate the performance of normalization, we leveraged eight
   normalization performance metrics that related to different aspects of
   gene expression measures.[205]^27

   The following four metrics evaluated normalization procedures based on
   how well the samples can be clustered according to factors of wanted or
   unwanted variation. Clustering by wanted factors was desirable, while
   clustering by unwanted factors was undesirable. We used silhouette
   widths as clustering measurements. For any clustering of n samples, the
   silhouette width of sample i was defined as:
   [MATH:
   <mrow><mrow><mi>s</mi><mi>i</mi><mi>l</mi></mrow><mrow><mo>(</mo><mi>i<
   /mi><mo>)</mo></mrow><mo
   linebreak="badbreak">=</mo><mfrac><mrow><mi>b</mi><mrow><mo>(</mo><mi>i
   </mi><mo>)</mo></mrow><mo>−</mo><mi>a</mi><mrow><mo>(</mo><mi>i</mi><mo
   >)</mo></mrow></mrow><mrow><mi>max</mi><mrow><mo>{</mo><mrow><mi>a</mi>
   <mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow><mo>,</mo><mi>b</mi><mrow><m
   o>(</mo><mi>i</mi><mo>)</mo></mrow></mrow><mo>}</mo></mrow></mrow></mfr
   ac><mo
   linebreak="goodbreak">∈</mo><mrow><mo>[</mo><mrow><mo>−</mo><mn>1,1</mn
   ></mrow><mo>]</mo></mrow><mtext>,</mtext></mrow> :MATH]
   (Equation 2)

   where a(i) denoted the average distance (by default, Euclidean distance
   over the first three PCs of expression measures) between the ith sample
   and all other samples in the cluster to which i was assigned and b(i)
   denoted the minimum average distance between the ith sample and samples
   in other clusters. The larger the silhouette widths, the better the
   clustering. Thus, the average silhouette width across all n samples
   provides an overall quality measure for a given clustering. Silhouette
   width was calculated with the silhouette function from R package
   cluster (v2.1.3). These metrics were defined as follows: 1) BIO_SIM:
   Group the n samples according to the value of a categorical covariate
   of interest (e.g., age group or treatment) and compute the average
   silhouette width for the resulting clustering. 2) BATCH_SIM: Group the
   n samples according to the batch and compute the average silhouette
   width for the resulting clustering. 3) EN_SIM: Group the n samples
   according to the assay (e.g., enrichment or input) and compute the
   average silhouette width for the resulting clustering. 4) PAM_SIM:
   Cluster the n samples using partitioning around medoids (PAM) for a
   range of numbers of clusters and compute the maximum average silhouette
   width for these clusters. PAM clustering was done by the pamk function
   from R package fpc (v2.2-9). Large values of BIO_SIM, EN_SIM and
   PAM_SIM and low values of BATCH_SIM were desirable.

   The next two metrics evaluated the association of log-count principal
   components (by default, the first 3 PCs) with “evaluation” principal
   components of wanted or unwanted variation. 1) UV_COR: The weighted
   coefficient of determination
   [MATH: <mrow><msup><mover
   accent="true"><mi>R</mi><mo>¯</mo></mover><mn>2</mn></msup></mrow>
   :MATH]
   for the regression of log-count principal components on factors of
   unwanted variation (UV) derived from negative evaluation genes. The
   submatrix of log-transformed unnormalized counts for negative
   evaluation genes is row-centered and scaled and factors of unwanted
   variation are defined as the right-singular vectors as computed by the
   svd function. 2) WV_COR: The weighted coefficient of determination
   [MATH: <mrow><msup><mover
   accent="true"><mi>R</mi><mo>¯</mo></mover><mn>2</mn></msup></mrow>
   :MATH]
   for the regression of log-count principal components on factors of
   wanted variation (WV) derived from positive evaluation genes. The WV
   factors were computed in the same way as the UV factors above, but with
   positive instead of negative evaluation genes. Large values of WV_COR
   and low values of UV_COR were desirable.

   The weighted coefficients of determination were computed as follows.
   For each type of evaluation criterion (i.e., UV, or WV), regressed each
   expression PC on all supplied evaluation PCs. Denoted SST[k] as the
   total sum of squares, SSR[k] as the regression sum of squares, and
   SSE[k] as the residual sum of squares for the regression for the kth
   expression PC. The coefficient of determination was defined as:
   [MATH: <mrow><msubsup><mi>R</mi><mi>k</mi><mn>2</mn></msubsup><mo
   linebreak="badbreak">=</mo><mfrac><msub><mrow><mi>S</mi><mi>S</mi><mi>R
   </mi></mrow><mi>k</mi></msub><msub><mrow><mi>S</mi><mi>S</mi><mi>T</mi>
   </mrow><mi>k</mi></msub></mfrac><mrow><mo>=</mo><mn>1</mn><mo>−</mo></m
   row><mfrac><msub><mrow><mi>S</mi><mi>S</mi><mi>E</mi></mrow><mi>k</mi><
   /msub><msub><mrow><mi>S</mi><mi>S</mi><mi>T</mi></mrow><mi>k</mi></msub
   ></mfrac><mtext>,</mtext></mrow> :MATH]
   (Equation 3)

   and weighted average coefficient of determination as:
   [MATH: <mrow><msup><mover
   accent="true"><mi>R</mi><mo>¯</mo></mover><mn>2</mn></msup><mo
   linebreak="badbreak">=</mo><mfrac><mrow><msub><mo>∑</mo><mi>k</mi></msu
   b><msub><mrow><mi>S</mi><mi>S</mi><mi>T</mi></mrow><mi>k</mi></msub><ms
   ubsup><mi>R</mi><mi>k</mi><mn>2</mn></msubsup></mrow><mrow><msub><mo>∑<
   /mo><mi>k</mi></msub><msub><mrow><mi>S</mi><mi>S</mi><mi>T</mi></mrow><
   mi>k</mi></msub></mrow></mfrac><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><msub><mo>∑</mo><mi>k</mi></ms
   ub><msub><mrow><mi>S</mi><mi>S</mi><mi>R</mi></mrow><mi>k</mi></msub></
   mrow><mrow><msub><mo>∑</mo><mi>k</mi></msub><msub><mrow><mi>S</mi><mi>S
   </mi><mi>T</mi></mrow><mi>k</mi></msub></mrow></mfrac><mrow><mo>=</mo><
   mn>1</mn><mo>−</mo></mrow><mfrac><mrow><msub><mo>∑</mo><mi>k</mi></msub
   ><msub><mrow><mi>S</mi><mi>S</mi><mi>E</mi></mrow><mi>k</mi></msub></mr
   ow><mrow><msub><mo>∑</mo><mi>k</mi></msub><msub><mrow><mi>S</mi><mi>S</
   mi><mi>T</mi></mrow><mi>k</mi></msub></mrow></mfrac><mtext>,</mtext></m
   row> :MATH]
   (Equation 4)

   The next two metrics evaluated the similarity of gene expression
   distributions. We defined gene-level relative log-expression (RLE)
   measures, as log-ratios of read counts to median read counts across
   samples, for comparing distribution of gene expression. RLE was defined
   as follows:
   [MATH:
   <mrow><msub><mrow><mi>R</mi><mi>L</mi><mi>E</mi></mrow><mrow><mi>i</mi>
   <mi>j</mi></mrow></msub><mo linebreak="badbreak">=</mo><mi
   mathvariant="italic">log</mi><mfrac><msub><mi>Y</mi><mrow><mi>i</mi><mi
   >j</mi></mrow></msub><mrow><msub><mrow><mi>M</mi><mi>e</mi><mi>d</mi><m
   i>i</mi><mi>a</mi><mi>n</mi></mrow><mi>i</mi></msub><msub><mi>Y</mi><mr
   ow><mi>i</mi><mi>j</mi></mrow></msub></mrow></mfrac><mtext>,</mtext></m
   row> :MATH]
   (Equation 5)

   for gene i in sample j. For similar distributions, the RLE should be
   centered around zero and have a similar distribution across samples.
   Thus, the metrics were defined as follows: 1) RLE_MED: Mean squared
   median RLE. 2) RLE_IQR: Variance of inter-quartile range (IQR) of RLE.
   Low values of RLE_MED and RLE_IQR were desirable.

   In the enONE framework, the expression measures were normalized
   according to a set of methods (including raw counts) and the eight
   metrics above were computed for each dataset. The performance
   assessment results can be visualized using biplots and the
   normalization procedures were ranked based on a function of the
   performance metrics. In particular, enONE defined a performance score
   by orienting the metrics (multiplying by ±1) so that large values
   correspond to good performance, ranking procedures by each metric, and
   averaging the ranks across metrics.

Comparison methods

   For benchmarking, we compared the enONE procedure with the
   regression-based method, RUVg, and the scaling-based approach, RADAR.
   For RUVg, we used the anchor set from Drosophila spike-ins as negative
   control genes and the first four estimated factors of unwanted
   variations for adjustment. For RADAR, we normalized the input libraries
   using the median-of-ratio method of DESeq based on input gene-level
   counts. To normalize the enrichment libraries, we estimated sample-wise
   size factors from the fold change
   [MATH:
   <mrow><msub><mi>F</mi><mrow><mi>n</mi><mo>,</mo><mi>j</mi></mrow></msub
   ><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mfrac><msub><mrow><mi>e</mi><mi>n</mi></m
   row><mrow><mi>n</mi><mo>,</mo><mi>j</mi></mrow></msub><msub><mrow><mi>i
   </mi><mi>n</mi></mrow><mrow><mi>n</mi><mo>,</mo><mi>j</mi></mrow></msub
   ></mfrac></mrow> :MATH]
   of the top 1% genes ranked by enrichment read counts, where
   [MATH:
   <mrow><msub><mrow><mi>e</mi><mi>n</mi></mrow><mrow><mi>n</mi><mo>,</mo>
   <mi>j</mi></mrow></msub></mrow> :MATH]
   is enrichment read count and
   [MATH:
   <mrow><msub><mrow><mi>i</mi><mi>n</mi></mrow><mrow><mi>n</mi><mo>,</mo>
   <mi>j</mi></mrow></msub></mrow> :MATH]
   is the DESeq normalized gene counts from gene n and sample j.

Linear regression for normalization performance evaluation

   R^2 values of fitted linear models were used to quantify the strength
   of the linear relationships between a source of unwanted variation,
   i.e., batch effect, and global sample summary statistics, such as the
   first k PCs (1 ≤ k ≤ 6).[206]^48 The lm R function was used for this
   analysis.

Spearman correlation

   Spearman correlation coefficients were used to determine the monotonic
   relationships between a source of unwanted variation, i.e., batch
   effect, and the raw or normalized gene-level read counts. The cor R
   function was used for this analysis.

ANOVA

   To evaluate the effects of the different sources of effect on the data,
   ANOVA were performed. For evaluating the batch effect, ANOVA tests were
   computed across batches. Genes with p value <0.01 were deemed to be
   affected by the batch effect. The aov R function was used for this
   analysis.

Vector correlation

   The Rozeboom squared vector correlation[207]^49 was used to quantify
   the strength of relationships between the first k expression PCs (1 ≤ k
   ≤ 6) and the enrichment covariates.

Adjusted rand index

   The adjusted Rand index (ARI), a correction of the Rand Index, measures
   the similarity between two clustering.[208]^50 The ARI was used to
   evaluate the performance of normalization methods in terms of
   separation between enrichment and input samples. The first three
   expression PCs was utilized to perform ARI analysis.

Jensen-Shannon distance

   To quantify differences in gene expression and NAD-RNA profiles between
   individuals, we calculated the pairwise distance for all individuals
   and their corresponding centroid using the Jensen-Shannon
   Distance.[209]^20^,[210]^21 We first computed the centroid distribution
   P[c] by averaging the transcriptomic or epitranscriptomic profiles from
   all samples, respectively. For each individual’s transcriptomic or
   epitranscriptomic distribution, the Jensen-Shannon Divergence (JSD) can
   be calculated as:
   [MATH: <mrow><mrow><mi>J</mi><mi>S</mi><mi>D</mi></mrow><mrow><mo
   stretchy="true">(</mo><mrow><msub><mi>P</mi><mi>c</mi></msub><mo>,</mo>
   <msub><mi>P</mi><mi>i</mi></msub></mrow><mo
   stretchy="true">)</mo></mrow><mo
   linebreak="badbreak">=</mo><mi>H</mi><mrow><mo
   stretchy="true">(</mo><mfrac><mrow><msub><mi>P</mi><mi>c</mi></msub><mo
   linebreak="badbreak">+</mo><msub><mi>P</mi><mi>i</mi></msub></mrow><mn>
   2</mn></mfrac><mo
   stretchy="true">)</mo></mrow><mo>−</mo><mfrac><mrow><mi>H</mi><mrow><mo
   stretchy="true">(</mo><msub><mi>P</mi><mi>c</mi></msub><mo
   stretchy="true">)</mo></mrow><mo
   linebreak="badbreak">+</mo><mi>H</mi><mrow><mo
   stretchy="true">(</mo><msub><mi>P</mi><mi>i</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow><mn>2</mn></mfrac><mtext>,</mtext><
   /mrow> :MATH]
   (Equation 6)

   where P[i] is the distribution for individual i and H is the Shannon
   entropy function:
   [MATH:
   <mrow><mi>H</mi><mrow><mo>(</mo><mi>X</mi><mo>)</mo></mrow><mrow><mo>=<
   /mo><mo>−</mo></mrow><mrow><munderover><mo>∑</mo><mrow><mi>i</mi><mrow>
   <mo>=</mo><mn>1</mn></mrow></mrow><mi>n</mi></munderover><mrow><mi>P</m
   i><mrow><mo stretchy="true">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo
   stretchy="true">)</mo></mrow><msub><mi
   mathvariant="italic">log</mi><mn>2</mn></msub><mrow><mo
   stretchy="true">(</mo><mrow><mi>P</mi><mrow><mo
   stretchy="true">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo
   stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">)</mo></mrow></mrow></mrow><mtext>,</mtext></mrow>
   :MATH]
   (Equation 7)

   Finally, the Jensen-Shannon Distance can be calculated as the square
   root of JSD.

Quantitative NAD-RNA-seq data analysis

   To identify NAD-RNA from sequencing data, we performed differential
   analysis using FindEnrichment function from enONE package. Significance
   of logarithmic fold changes was determined by a likelihood ratio test
   to approximate p values, and genes were adjusted for multiple testing
   using the Benjamini-Hochberg procedure to yield a false discovery rate
   (FDR). NAD-RNAs were defined as fold change of normalized transcript
   counts ≥2 and FDR <0.05 in enrichment samples compared to those in
   input samples. The fold change of normalized counts between pairwise
   enrichment and input sample was denoted as NAD modification level. Gene
   annotation information, such as chromosome, gene types, and gene
   lengths were retrieved from Ensembl (GRCh38.94) annotations. The violin
   plot, boxplot, bar plot, line chart and scatterplot were generated by R
   package ggplot2 (v3.3.6).[211]^51

RNA secondary structure analysis

   For each transcript with 5′ untranslated region (UTR), the sequence of
   100 bp region downstream of transcription start site (TSS) was used to
   calculate RNA minimum free energy (MFE) by ViennaRNA with default
   parameter.[212]^19 To generate a background distribution of MFE,
   transcripts that did not produce NAD-cap were randomly sampled 1,000
   times from the whole transcriptome with a similar size (n = 600) to
   transcripts producing NAD-cap (n = 589).

Pathway enrichment analysis

   Pathway enrichment analysis was performed using Metascape
   (v3.5).[213]^52 Pathways were defined as the molecular pathways of
   Reactome, the biological process (BP) of GO, the Kyoto Encyclopedia of
   Genes and Genomes (KEGG), the WikiPathways (WP), the canonical pathways
   of MSigDb, and the CORUM database. Metascape clustered enriched terms
   into non-redundant groups based on similarities between terms and used
   the most significant term within each cluster to represent the cluster.
   The resulting clusters of pathways were manually reviewed. Enrichment
   was tested using the hypergeometric test. Multiple hypothesis
   correction was performed with Benjamini-Hochberg procedure, and the
   significance threshold was defined as α = 0.05.

Hierarchical clustering analysis

   Hierarchical clustering with a Euclidean distance metric was performed
   for both rows (NAD-RNAs, complete method) and columns (samples, ward’s
   method) on the scaled NAD modification matrix. Clustering and
   corresponding heatmaps were generated by R package Complex Heatmap
   (v2.12.1).[214]^53 Using cutree function with “k = 2” parameter, we
   identified 3 clusters of NAD-RNAs changing with age, ranging from 201
   to 608 NAD-RNAs.

Trajectory analysis

   To estimate NAD modification trajectories during aging, log NAD
   modification levels were Z-scored, and LOESS regression was fitted for
   each gene. Similarly, log expression levels were used to estimate the
   gene expression trajectories during aging. The trajectory of clusters
   was estimated using the median levels of genes in each cluster.
   Pathways were queried as above to gain insights into the biological
   functions of each cluster. Top five non-redundant enriched terms were
   shown. To identify NAD-RNAs that correlated with age, we computed
   Spearman’s rank correlation coefficients (R[s]) for each NAD-RNAs on
   the basis of NAD modification and age. The resulting correlation
   metrics was then filtered, such that only NAD-RNAs that were
   significantly correlated with age (p < 0.05 and | R[s] | ≥ 0.3) were
   considered. Similarly, correlation between age and gene expression
   levels was computed.

Age prediction model

   We developed different age prediction models for human PBMCs based on
   NAD-modified epitranscriptome, transcriptome, and the combination of
   NAD-modified epitranscriptome and transcriptome, respectively. To
   construct those models, we employed four machine learning approaches,
   including elastic net, lasso, ridge, and random forest regression.
   Aging cohort was randomly split into the discovery cohort (N = 35) for
   model training, and the validation cohort (N = 26) for independent
   evaluation. Age prediction was performed by regressing chronological
   age on log-transformed feature levels (i.e., NAD modification and
   transcript abundances) using the discovery cohort. To build a robust
   aging clock, we pre-selected features using the age-correlated NAD-RNAs
   (p < 0.05 and | R[s] | ≥ 0.3) and transcripts (p < 0.05 and | R[s] | ≥
   0.5) as a starting point. To begin, we randomly split the discovery
   cohort into training (70%) and test (30%) sets with balanced ages.
   Model optimization including hyperparameter tuning was done by a grid
   search with leave-one-out cross-validation (LOOCV) based on training
   sets. Model performance was assessed on test sets, using several
   statistics including median absolute error (MAE), Pearson’s correlation
   coefficient and its associated p-value. Furthermore, we performed a
   cross-validation scheme for arriving the least biased estimates of the
   accuracy of the aging clocks, consisting of leaving out a single sample
   from the regression, predicting age for that sample, and iterating over
   all samples on discovery cohort. The best-tuned hyperparameters of
   trained models were listed in [215]Table S11. Above model training and
   hyperparameter tuning are performed with R packages caret (v6.0-93),
   glmnet (v4.1-4) and randomForest (v4.7–1.1).

Association of age gap index and physiological phenotypes

   Subtracting chronological age from predicted age, termed as the age
   gap, provided a normalized measure of the extent to which an individual
   appeared older (age gap >0) or younger (age gap <0) than same-aged
   peers.[216]^54 Then, we explored whether age gap was associated with
   physiological phenotypes, such as blood pressure, cholesterol levels,
   blood cell counts. Associations were evaluated using spearman’s
   correlation coefficients, while adjusting for chronological age and
   gender through linear regression. The associations of age gap and
   physiological phenotypes were listed in [217]Table S12.

Statistical analyses

   All p values reported herein were calculated using the non-parametric
   Mann-Whitney rank test unless otherwise stated.

Acknowledgments