Abstract
The metabolome includes not just known but also unknown metabolites;
however, metabolite annotation remains the bottleneck in untargeted
metabolomics. Ion mobility – mass spectrometry (IM-MS) has emerged as a
promising technology by providing multi-dimensional characterizations
of metabolites. Here, we curate an ion mobility CCS atlas, namely
AllCCS, and develop an integrated strategy for metabolite annotation
using known or unknown chemical structures. The AllCCS atlas covers
vast chemical structures with >5000 experimental CCS records and ~12
million calculated CCS values for >1.6 million small molecules. We
demonstrate the high accuracy and wide applicability of AllCCS with
medium relative errors of 0.5–2% for a broad spectrum of small
molecules. AllCCS combined with in silico MS/MS spectra facilitates
multi-dimensional match and substantially improves the accuracy and
coverage of both known and unknown metabolite annotation from
biological samples. Together, AllCCS is a versatile resource that
enables confident metabolite annotation, revealing comprehensive
chemical and metabolic insights towards biological processes.
Subject terms: Metabolomics, Mass spectrometry, Databases, Software
__________________________________________________________________
Collision cross section (CCS) information can aid the annotation of
unknown metabolites. Here, the authors optimize the machine-learning
based prediction of metabolite CCS values and curate a 1.6 million
compound CCS atlas, improving annotation accuracy and coverage for
known and unknown metabolites.
Introduction
Untargeted metabolomics enables comprehensive measurements of a
significant number of metabolites in complex systems, and identifies
the accrued metabolic changes with physiological and pathological
status, such as diseases^[36]1,[37]2. Metabolites in the metabolome
include knowns and unknowns generated from biotransformation of
endogenous and exogenous compounds, and have a vast diversity of
chemical structures^[38]3. Metabolite identification remains the
central bottleneck in liquid chromatography—mass spectrometry
(LC–MS)-based untargeted metabolomics^[39]3–[40]5. The standard
strategy for metabolite identification is to match accurate mass and
tandem mass spectra (MS/MS or MS2) with standard spectral libraries
(e.g., METLIN^[41]6, MASSBANK^[42]7, and NIST) and/or in-silico
predicted MS/MS spectra^[43]8. However, standard spectral libraries
suffer from the limited coverage, while the in-silico prediction lacks
high accuracy^[44]4. Other bioinformatic approaches (e.g., GNPS^[45]9,
MetDNA^[46]10) also use MS2 spectra and molecular networking algorithms
for metabolite annotations. All of these strategies require unique and
high quality of experimental MS2 spectra. However, low molecular-weight
metabolites usually have very sparse MS2 spectra and lack
characteristic product ions for structural elucidation^[47]4. Some
metabolite isomers share highly similar MS2 spectra. Many experimental
factors, such as high sample complexity, low concentration and
co-elution of isobaric and isomeric metabolites, present challenges to
acquire high quality of MS2 spectra^[48]4. In addition, annotation of
unknown metabolites with new chemical structures is still a challenge
in untargeted metabolomics^[49]3,[50]11. These issues cause low
coverage and high false-positive rate of metabolite annotation,
suggesting that other physiochemical properties should be developed for
metabolite annotation.
Recently, ion mobility–mass spectrometry (IM–MS) has emerged as a
promising technique for untargeted metabolomics by providing
multi-dimensional separation and high selectivity^[51]12–[52]15.
Importantly, ion mobility can rapidly separate metabolite ions based on
their differences in rotationally averaged surface area or collision
cross-section (CCS)^[53]16,[54]17. It enables to distinguish the
isomeric metabolites that commonly exist in biological
samples^[55]18–[56]21. Unlike retention time (RT) and MS/MS spectra
that are prone to be affected by many experimental factors, CCS is
highly reproducible across instruments and labs, and it is much more
feasible to be standardized^[57]16,[58]22. The IM-derived CCS value is
a unique physiochemical property to improve the accuracy of metabolite
annotation. Significant efforts have been made to curate large-scale
experimental and calculated CCS databases^[59]23. For example, Baker
group^[60]24 and McLean group^[61]25,[62]26 measured chemical standards
to construct experimental CCS databases with >1000 CCS values.
Nevertheless, these CCS resources are reported in quite different
formats, and lack appropriate procedures and tools for data collection,
collation, standardization, and sharing. Our group and others developed
machine-learning-based prediction (e.g., MetCCS^[63]27,[64]28,
LipidCCS^[65]29, DeepCCS^[66]30) and quantum chemistry-based
theoretical calculation (e.g., ISiCLE^[67]31) approaches to generate
large-scale CCS values for metabolites, lipids and other compounds.
Coupling IM–MS with LC separation and data-independent or
data-dependent MS/MS techniques (e.g., MS^E, AIF, and PASEF) enables
simultaneous acquisition of four-dimensional metabolomics data within
one analysis, including MS1, RT, CCS, and MS/MS^[68]32,[69]33. However,
limited studies have integrated multi-dimensional properties in IM–MS
towards the large-scale annotation of both known and unknown metabolite
in untargeted metabolomics^[70]11.
Here, we curate an ion mobility CCS atlas, namely, AllCCS, to embrace
both experimental and predicted CCS values, and develop an integrated
multi-dimensional match strategy to enable annotation of both known and
unknown metabolites in IM–MS-based untargeted metabolomics
(Fig. [71]1). The AllCCS atlas includes >5000 experimental CCS records
and ~12 million predicted CCS values for >1.6 million compounds. The
newly optimized machine-learning-based prediction utilizes a large
training dataset with high diversity of chemical structures and a
representative structure similarity (RSS) score to evaluate the
accuracy of predicted CCS values. Our data shows that AllCCS
outperforms other CCS calculation tools in terms of both coverage and
accuracy. We further demonstrate that the use of AllCCS atlas and/or
in-silico MS/MS spectra improves the annotation performances for both
known and unknown metabolites in untargeted metabolomics. Taken
together, AllCCS atlas is a valuable and unique resource to support
IM–MS-based multi-dimensional metabolomics. It facilitates expanding
the chemical coverage of annotation and extending the assessment of
metabolic pathways and activities, further revealing comprehensive
chemical and metabolic insights towards biological processes.
Fig. 1. Overview of AllCCS atlas and annotation of known and unknown
metabolites.
[72]Fig. 1
[73]Open in a new tab
The AllCCS atlas hosts 5119 experimental CCS records, 3539 unified CCS
values, and ~12 million predicted CCS values for ~1.7 million
compounds. AllCCS can be integrated with in-silico MS/MS spectra to
enable the multi-dimensional annotation for known and unknown
metabolites in untargeted metabolomics.
Results
Unified AllCCS database
To curate ion mobility CCS atlas, we develop the unified AllCCS
database to store, standardize, and share the experimental and
predicted CCS values. First, we collected 5119 reported experimental
CCS values for 2193 compounds from 14 datasets, four laboratories, and
two commercial IM–MS instruments (Supplementary Table [74]1). Then, we
developed a five-step standardization procedure to clean up and unify
all experimental CCS records, including collection of meta information,
quality check, outlier removal, calculation of unified CCS values, and
assignment of confidence levels (“Methods” and Supplementary
Fig. [75]1). As a result, a total of 3539 unified CCS values with
different adduct forms were calculated for 2193 compounds with
definitive confidence levels (Fig. [76]2a). AllCCS provides
wide-coverage of experimental CCS values for small molecules. Compared
to other CCS databases, AllCCS is a platform to unify different CCS
values, and overcomes the variations among different instruments and
labs. The removal of outliers using trend line technique improved the
accuracy, and has been validated in recent publications^[77]30
(Supplementary Fig. [78]2). The unified CCS values are divided into
level 1, level 2, level 3 and conflict with 462, 448, 2491, and 138
values, respectively (Supplementary Table [79]2). For example,
3,5-Diiodothyronine has a unified CCS value of 195.5 Å^2 for [M + H]^+
with confidence level 1, because it has been reported twice on drift
tube IM-MS (DTIM-MS) from different labs and the maximum difference is
within 1% (Supplementary Fig. [80]3). Currently, the unified CCS values
comprises of 2423 cations and 1116 anions, and covered nine and six
adducts in positive and negative modes, respectively (Fig. [81]2b and
Supplementary Table [82]3). In terms of chemical diversity, they
covered 15 super classes, 144 classes and 257 subclasses according to
the definition of ClassyFire^[83]34 (Fig. [84]2c and Supplementary
Table [85]4). Among them, lipids and lipid-like molecules,
organheterocyclic compounds, and benzenoids are the major super
classes. We also compared the structural diversity of compounds in
experimental AllCCS database with human metabolome database (HMDB) and
DrugBank. The results showed that experimental CCS values covered 51.3%
and 78.4% of chemical spaces of HMDB and DrugBank, respectively
(Supplementary Fig. [86]4). These results demonstrated that compounds
in AllCCS have a high diversity and representativeness of chemical
structures. The unified CCS values are accessible in AllCCS webserver
([87]http://allccs.zhulab.cn/).
Fig. 2. Unified AllCCS atlas for both experimental and predicted CCS values.
[88]Fig. 2
[89]Open in a new tab
a Statistics of compounds and unified CCS values in AllCCS and other
databases^[90]12,[91]24,[92]26; b statistics of adduct ions for unified
CCS values in AllCCS; c chemical diversity for compounds with unified
CCS in AllCCS, which was analyzed using ClassyFire; d statistics of
compounds and predicted CCS values in AllCCS and other
databases^[93]27,[94]31; e, f correlations between predicted and
experimental CCS values for external validation sets 1 (e) and 2 (f); g
median relative errors (MREs) of predicted CCS values for 10 super
classes of chemical structures; h, i cumulative percentages of
predicted CCS values with indicated relative errors for external
validation sets 1 (h) and 2 (i). The insert bar plot displayed the
percentages of predicted CCS values within a certain relative error
obtained from different tools; j heat map displaying the comparative
prediction errors of ten super classes of chemical structures obtained
from different tools; median relative error for each super class is
shown in the pane while the color of pane is MRE normalized as Z-score.
The symbol “*” represents that AllCCS has the lowest prediction error
among these tools. Source data are provided as a Source Data file.
CCS prediction and performance benchmark
In AllCCS, we further employed the new unified experimental CCS
database, and optimized our machine-learning algorithm to predict CCS
values of small molecules in a large-scale. Compared with MetCCS,
AllCCS has several distinct features: (1) a large training dataset with
high diversity of chemical structures (1873 compounds in total;
Supplementary Data [95]1); (2) reduction of molecular descriptors to 15
and 9 for positive and negative modes, respectively; (3) development of
representative structure similarity (RSS) score to estimate prediction
accuracy. The details for machine-learning-based prediction were
provided in “Methods”. Now, AllCCS includes a total of 1,670,596
compounds and 11,697,711 predicted CCS values, and covers seven popular
compound databases—KEGG^[96]35, HMDB^[97]36, LMSD^[98]37, MINE^[99]38,
DrugBank^[100]39, DSSTox^[101]40, and UNPD^[102]41. To the best of our
knowledge, AllCCS is the largest and most comprehensive CCS database
(Fig. [103]2d). All predicted CCS values were specified to confidence
level 4, and have been deployed in AllCCS webserver. These records of
compounds can be easily retrieved with different identifiers, such as
SMILES, InChI, and InChIKey. AllCCS also supports users to predict CCS
values for new compounds by inputting SMILES structures.
We validated the performance of AllCCS using two independent and
external datasets, including validation set 1 (662 CCS values for
metabolites and lipids; Supplementary Data [104]2), and validation set
2 (229 CCS values for drugs and natural products; Supplementary
Data [105]3). Excellent consistencies between experimental and
predicted CCS values were observed in both datasets (Fig. [106]2e, f
and Supplementary Table [107]5). Specifically, for metabolites and
lipids, the median relative errors (MREs) were 1.66% and 1.74%, while
R^2 were 0.9901 and 0.9850 in positive and negative modes, respectively
(Fig. [108]2e). For drugs and natural products, MREs were 1.81% and
2.25%, while R^2 were 0.9687 and 0.9230 in positive and negative modes,
respectively (Fig. [109]2f). These results demonstrated AllCCS can
predict CCS values with low errors for both endogenous and exogenous
small molecules. Consistently, similar results were also obtained for
different chemical classes and types of ion adducts (Fig. [110]2g and
Supplementary Fig. [111]5). Taken together, AllCCS can accurately
predict CCS values for small molecules with a vast diversity of
chemical structures.
We also benchmarked the performance of AllCCS with MetCCS^[112]27,
DeepCCS^[113]30, and ISiCLE^[114]31 using validation sets
(Fig. [115]2h–j and Supplementary Data [116]4). The results revealed
that AllCCS made pivotal improvements of prediction accuracy.
Specifically, there were 84% of CCS values with relative error <4% in
AllCCS, and only 70%, 62%, and 28% for MetCCS, DeepCCS and ISiCLE for
metabolites and lipids, respectively (Fig. [117]2h). Similar results
were also observed for drugs and products, wherein 81%, 65%, 62%, and
35% of CCS values with relative error <4% were in AllCCS, MetCCS,
DeepCCS, and ISiCLE, respectively (Fig. [118]2i). In addition to
accuracy, AllCCS also has advantages in the prediction coverage and
applicability. Specifically, AllCCS demonstrated the best prediction
accuracy for most chemical super classes (8 out of 10), such as
alkaloids and derivatives, benzenoids, and organic nitrogen compounds
(Fig. [119]2j). For super classes such as alkaloids and derivatives and
lipids and lipid-like molecules, only AllCCS can accurately predict CCS
values with MRE <2%. Some examples were provided in Supplementary
Fig. [120]6. Finally, AllCCS has made other seminal improvements
compared with other tools, including implementation, time consuming and
visualization (Supplementary Table [121]6). Collectively, these results
demonstrated that AllCCS outperforms other tools in calculating CCS
values in terms of accuracy and coverage.
Structural similarity and CCS prediction accuracy
We demonstrated that the structural similarity between the inputted
chemical structures and the training dataset determines accuracy of
predicted CCS values. To validate it, we divided our training and
validation datasets into five super classes using ClassyFire in both
positive and negative modes. We intentionally removed the compounds of
one super class from the training dataset (for example, lipids and
lipid-like molecules), and built machine-learning-based prediction
model using the rest compounds as usual (Fig. [122]3a). Then, CCS
values in validation sets were predicted and divided into two types:
the excluded super class (i.e., lipids and lipid-like molecules) and
other super classes, and further compared with the results in
Fig. [123]2g. Taken lipids and lipid-like molecules as examples, the
predicted CCS values in validation set showed significantly larger
errors after excluding lipids from the training set (Fig. [124]3b). We
also observed that other super classes have similar prediction errors
between before and after excluding lipids from the training dataset.
Then, we repeated the process for each super class, and observed
similar results (Fig. [125]3c). The results demonstrated that the
prediction errors of the excluded super classes (MRE = 3.14%) were
significantly larger than that of included super classes (MRE = 1.63%,
Fig. [126]3d). Specifically, the prediction errors of CCS values in
excluded super classes had an average relative error (ARE) as high as
9.16%. As a comparison, the included super classes only had an ARE of
2.20%. Therefore, the results confirmed that the structural similarity
between the inputted chemical structures and the training dataset
determines the CCS prediction accuracy.
Fig. 3. Representative structure similarity (RSS) and CCS prediction
accuracy.
[127]Fig. 3
[128]Open in a new tab
a The simulation workflow for investigating the structural similarity
and prediction accuracy of CCS values; b comparison of CCS prediction
accuracy between before and after excluding lipid and lipid-like
molecules from the training set; the abbreviations “ARE” and “MRE”
represent average relative error and median relative error,
respectively; p-values determined by two-sided Wilcoxon rank-sum test;
c, d comparison of CCS prediction accuracy between before and after
excluding one super class from the training set; p-value determined by
two-sided Wilcoxon rank-sum test; e correlation between the RSS scores
and the relative errors of CCS prediction; the error bands represent
95% confidence interval; p-value determined by linear regression; f
correlation of RSS scores and prediction errors in validation sets; the
insert bar plot displays MREs for different RSS groups; the error bands
represent 95% confidence interval; p-value determined by linear
regression; g two compound examples for RSS and CCS prediction
accuracy; The abbreviation “TC” represents tanimoto coefficient; h
distribution of RSS scores in seven common compound databases. Source
data are provided as a Source Data file.
Representative structure similarity for accuracy evaluation
In AllCCS, we further developed the representative structure similarity
(RSS) score to quantify the structural similarity between the given
compound and the training dataset, and to evaluate its relationship
with the accuracy of predicted CCS values. RSS was calculated using
molecular fingerprinting, and ranges from 0 to 1, representing
completely different to highly similar structures compared to the
training dataset (see “Methods”). Then, we found that there was a
strong and significant correlation between RSS scores and the relative
errors of predicted CCS values generated in the designed experiment in
Fig. [129]3a (p-value = 3.19 × 10^−42; Fig. [130]3e). We empirically
divided the RSS scores into small (RSS ≤ 0.6), medium
(0.6 < RSS ≤ 0.8), and large (RSS > 0.8) groups. Large RSS group had
significantly smaller MREs in CCS prediction than the small RSS group
(1.56% vs. 5.90% in Fig. [131]3e). In AllCCS, we further validated the
use of RSS score to evaluate CCS prediction accuracy in validation
datasets (Fig. [132]3f). Compounds with small RSS had larger MRE (2.6%)
than those with large RSS (1.7%). For examples, the CCS value of
cardamonin was accurately predicted with a relative error 0.4% since it
has a high RSS score of 0.9363. As a comparison, the CCS value of
triclocarban had a large prediction error of 14.9% since it has a low
RSS score of 0.5645 (Fig. [133]3g). Finally, we also calculated the RSS
scores for compounds in common databases—KEGG, HMDB, LMSD, MINE,
DrugBank, DSSTox, and UNPD, and found that >80% of compounds have high
or medium RSS scores, and should be generated accurate CCS values (<2%)
using AllCCS (Fig. [134]3h). Altogether, we proved that RSS implemented
in AllCCS is able to evaluate and reflect the accuracy of CCS
prediction, and AllCCS has a wide applicability for different small
molecules.
AllCCS improves known metabolite annotations
With the large-scale AllCCS atlas, we first investigated its
application for annotation of known compounds using the validation set
2 as examples. We matched the experimental m/z and CCS values of each
compound to the whole AllCCS database (with ~1.67 million compounds).
The average of candidates was significantly reduced from 1046 to 255
(76%) with the addition of CCS match (Fig. [135]4a and Supplementary
Data [136]5). Similar results were obtained when performing
multi-dimensional match using experimental m/z, as well as MS/MS and
CCS values predicted by AllCCS and CFM-ID, respectively. The average of
candidates was also significantly reduced from 553 to 144 (74%) with
the addition of CCS match (Fig. [137]4a). We also demonstrated this
using experimental MS/MS spectral library from GNPS^[138]42 with a
total of 13,499 compounds. Similarly, the average candidates were
reduced from 7.3 to 1.7 (77%) with the addition of CCS match
(Supplementary Fig. [139]7). Therefore, ~75% of annotated candidates
were filtered with the addition of CCS match. In addition, the
candidate reduction through adding CCS match is effective with
different database scales (Supplementary Fig. [140]8).
Fig. 4. AllCCS improves known metabolite annotation.
[141]Fig. 4
[142]Open in a new tab
a Candidate reduction with addition of CCS match demonstrated using
validation set 2; b percentages of rank improvement for correct
candidates with addition of CCS match to multi-dimensional match; three
in-silico MS/MS prediction tools (CFM-ID, MetFrag, and MS-FINDER) were
used to generate predicted MS/MS spectra; c an example of rank
improvement for the compound 6-Hydroxycoumarin with addition of CCS
match in multi-dimensional match; d schematic illustration of
multi-dimensional annotation of metabolites in biological samples; e
candidate reduction with addition of CCS match in metabolite annotation
of MEF cell samples; f the candidate reductions in different biological
samples; the CCS match tolerance was set as 4%; g percentages for
candidate reduction with multi-dimensional match for each feature in
MEF cell samples; h the hypoxanthine was successfully annotated with
multi-dimensional properties; MS-FINDER was used to generate predicted
MS/MS spectra for metabolite candidates. Source data are provided as a
Source Data file.
We also demonstrated that the addition of CCS match into the
multi-dimensional match improved the rank of correct candidates
(m/z + MS/MS + CCS vs. m/z + MS/MS matches; Fig. [143]4b). We found
that the addition of CCS match improved the rank for most candidates
ranging from 81.2% to 88.8% when different in-silico MS/MS tools (i.e.,
CFM-ID^[144]43, MetFrag^[145]44, and MS-FINDER^[146]45) were used
(Fig. [147]4b). Similar results also were obtained in negative
ionization mode (Supplementary Fig. [148]9). Taken the annotation of
6-hydroxycoumarin as an example, with the addition of CCS match, the
number of potential candidates decreased from 956 to 181, and the rank
for the correct candidate increased from 129th to 6th (Fig. [149]4c).
Other annotation improvement examples were also provided in
Supplementary Fig. [150]10.
Next, we demonstrated the annotation of known metabolites in biological
samples with multi-dimensional match using m/z, CCS, and MS/MS spectra
(Fig. [151]4d). Here, the KEGG and HMDB databases were used. For mouse
embryonic fibroblast (MEF) cell sample, we putatively annotated a total
of 2729 peaks from the acquired multi-dimensional liquid
chromatography–ion mobility–mass spectrometry (LC–IM–MS/MS) data with
multi-dimensional properties (Supplementary Fig. [152]11). Similarly,
we validated that the average of candidates for these peaks was
effectively reduced from 9.2 to 4.5 with the addition of CCS match
(Fig. [153]4e, f). Similar results were also obtained for other
biological samples, such as human plasma and fruit fly tissues
(Fig. [154]4f and Supplementary Fig. [155]12). We found that 2038 out
of 2729 peaks (75%) had reduced candidates with different degrees
(Fig. [156]4g). Taken the feature M137T285C127 as an example, the
annotated candidates were reduced step-wise from 9 to 2 with the
integration of multi-dimensional match, and was finally annotated as
hypoxanthine (1st rank; Fig. [157]4h). More examples were provided in
Supplementary Fig. [158]13. In addition, we found the percentages of
candidate reduction ranged 40–76% from high to low abundant features,
indicating the better improvement for low abundant features
(Supplementary Fig. [159]14). Combined, the integration of
multi-dimensional properties especially the CCS match improved the
annotation confidence with reduced false candidates and improved ranks.
AllCCS enables unknown metabolite annotation
Unknown metabolites are generated from uncharacterized resources, such
as enzymatic transformation of endogenous metabolites,
biotransformation of exogenous compounds (e.g., from environment) and
gut microbiota. Here, we further investigated the integration of
multi-dimensional properties (i.e., m/z + CCS + MS/MS) for unknown
metabolite annotation in biological samples (Fig. [160]5a). First, we
generated the possible unknown metabolites from knowns in KEGG. We used
all 16,023 metabolites in KEGG, and a total of 178 metabolic reactions
and 117 enzymes to perform the in-silico enzymatic reaction^[161]46
(Supplementary Data [162]6). Through this procedure, we have created a
total of 100,404 possible unknown compounds via a 2-step in-silico
enzymatic reaction, and expanded the chemical space of KEGG by
sevenfolds (Fig. [163]5b). Among them, there are 5704 known unknowns
(5.7%) for those included in PubChem but not in KEGG, and 94,700
unknown unknowns (94.3%) for those not included both in either PubChem
or KEGG (Supplementary Fig. [164]15). For example, the in-silico
enzymatic reaction of phosphoenolpyruvate generated three unknown
metabolites (Fig. [165]5c). For all generated unknown metabolites, we
have calculated and predicted the m/z and CCS values using AllCCS and
MS/MS spectra using MS-FINDER. This extended unknown database provides
a potential for unknown metabolite annotation. With the
multi-dimensional match to this database, we annotated both known and
unknown metabolites in mouse liver tissue samples. A total of 1223
features with 6092 metabolites were putatively annotated, including
2275 KEGG metabolites and 3817 unknowns (Fig. [166]5d and Supplementary
Data [167]8). Among them, 67.0% of features had an unknown annotation.
For example, the feature M384T767C189 (m/z: 384.1129; RT: 767s; CCS:
181.9 Å^2) had 22 possible candidates in the extended database
(Fig. [168]5e). Among them, 4 and 21 candidates were further reduced
with multi-dimensional match. Finally, an unknown metabolite
(ExtDB016054) was annotated and further confirmed with the chemical
standard (Fig. [169]5f and Supplementary Fig. [170]16).
Fig. 5. AllCCS enables unknown metabolite annotation.
[171]Fig. 5
[172]Open in a new tab
a Schematic illustration of generation of possible unknown metabolites
from KEGG and prediction of CCS and MS/MS spectra to support
multi-dimensional match-based annotation; b number of unknown
metabolites; c generation of unknown metabolites from
phosphoenolpyruvate; d annotation of known and unknown metabolites in
mouse liver with the multi-dimensional match; the inner pie plot is the
composition of known and unknown metabolites; e an unknown metabolite
annotation for the feature M384T767C189; f validation of unknown
metabolite using chemical standard. Source data are provided as a
Source Data file.
We further investigated how AllCCS facilitated characterization of
metabolic activities through unknown metabolite annotation. First, we
analyzed 368 dysregulated metabolic features associated with aging in
mice (p-value ≤ 0.05; 36-week vs. 104-week; Fig. [173]6a), and
performed pathway enrichment analysis using KEGG metabolites. Five
metabolic pathways were enriched (p-value ≤ 0.05; Supplementary
Fig. [174]17) and showed declined activities with aging (left panel in
Fig. [175]6b), such as purine metabolism, nicotinate and nicotinamide
metabolism, phenylalanine metabolism. The observations were consistent
with previous reports^[176]10. Second, we extended the analysis from
known KEGG metabolites to their related unknowns. Clearly, the unknown
metabolites generated from five enriched metabolic pathways also had
declined activities with aging (right panel in Fig. [177]6b). For
example, adenylosuccinic acid and its derived unknown (ExtDB016054)
were decreased during mouse aging (Fig. [178]6c). Interestingly,
unknown metabolite ExtDB016054 showed more significant changes compared
with its reactant precursor. Finally, we also performed chemical
structure enrichment analysis for unknown metabolites, and 14 out of
174 subclasses were enriched (Supplementary Fig. [179]17). Among them,
we found that several chemical subclasses related to purine metabolism
showed declined activities with aging, such as pyrimidines and
pyrimidine derivatives, and hydropyridines (Fig. [180]6d, e).
Altogether, AllCCS facilitated expanding the metabolite coverage of
annotation and extending the assessment of metabolic pathways and
activities by providing new chemical structures.
Fig. 6. Expansion of metabolite coverage and assessment of metabolic
activities with unknown annotation.
[181]Fig. 6
[182]Open in a new tab
a Volcano plot showing the dysregulated features in aging mice (36-week
vs. 104-week; p-value ≤ 0.05; two-sided Student’s t-test); b heat maps
showing age-dependent activities of five enriched KEGG metabolic
pathways (left panel) and their related unknowns (right panel); c
examples of adenylosuccinic acid and its derived unknown (ExtDB016054)
down-regulated in aging mice (n = 10, biologically independent samples
for each group; two-sided Student’s t-test); d heat map showing
age-dependent activities of 14 enriched chemical subclasses; e
annotated unknowns from the subclass of pyrimidines and pyrimidine
derivatives down-regulated in aging mice (n = 10, biologically
independent samples for each group; two-sided Student’s t-test). The
lower, middle, and upper lines in box plots (c, e) correspond to 25th,
50th, and 75th quartiles, and the whiskers extend to the most extreme
data point within 1.5 interquartile range (IQR). Source data are
provided as a Source Data file.
Discussion
In this work, we developed a large-scale ion mobility CCS atlas,
namely, AllCCS, to support metabolite annotation in IM–MS-based
metabolomics. Although several experimental CCS databases have been
developed^[183]22,[184]24,[185]26,[186]47, AllCCS is unique because it
provides a unified platform to store, standardize and share
experimental CCS values from different labs and instruments. One
possible concern is the inconsistency across different IM–MS
instruments. A recent work from Schmitz group observed that most
compounds have <1% errors between traveling wave IMS (TWIMS) and DTIMS,
but some compounds showed larger deviations up to 6.2%^[187]48.
Therefore, in AllCCS, we developed a five-step standardization
procedure to automatically clean up and unify the experimental CCS
records, which facilitates overcoming the variations from different
instruments and labs. We believe the consistency of reported CCS values
will be further improved with the launch of guidelines from IM–MS
research community^[188]17. In addition, every user could access AllCCS
webserver to view, calculate and download both experimental and
predicted CCS values in AllCCS. We are also working on deploying CCS
values into HMDB and other popular databases to make it available for
the wider community. With the rapid growing of reported CCS values,
AllCCS will be continually expanded and updated, making it as a
valuable resource for both IM–MS and metabolomics research.
Different strategies have been developed for CCS calculation, such as
MetCCS^[189]27, LipidCCS^[190]29, DeepCCS^[191]30, ISiCLE^[192]31,
DarkChem^[193]49, CCSbase^[194]50 etc. However, the efficiency,
accuracy and generalization capability for these methods need further
improvements. Here, we demonstrated that AllCCS outperforms other tools
in terms of efficiency, accuracy and coverage. The improvements of CCS
prediction are attributed by three major factors (Supplementary
Table [195]7): (1) AllCCS used large and wide-coverage CCS records to
train the machine-learning-based prediction model; (2) the use of a
five-step standardization strategy and unified CCS values overcome the
biases across different instruments and labs and improved the data
quality of the training set; (3) the selection of optimized molecular
descriptors (MDs) improved the prediction accuracy. In addition, AllCCS
includes the representative structure similarity (RSS) score to
estimate the prediction accuracy for one compound. However, some
limitations in CCS prediction are still presented. For example, most of
metabolite ions have their unique CCS values, but some may have
multiple CCS values for different conformations. Currently, AllCCS and
other tools can only predict one CCS value for one conformation,
presumably the most compact one. The introduction of quantum chemistry
for conformation generation and 3-D molecular descriptors into the
machine-learning-based prediction may help to address this challenge.
The second challenge is the CCS prediction of isomers (e.g., cis-trans
isomers in lipids). Although the CCS prediction has made effective
improvements to ~2% prediction errors, identification of metabolite
isomers is still a challenge due to the limit resolution of ion
mobility separation (e.g., 40–60 for DTIMS and TWIMS). We demonstrated
this challenge with an example of four monosaccharide phosphate
isomers, which were poorly separated with IM (Supplementary
Fig. [196]18). However, these isomers would be partially separated with
an IM resolution of 200, and baseline separated with an IM resolution
of 500. Therefore, both CCS prediction and annotation accuracy will be
further improved with the availability of high-resolution IM
instruments, such as TIMS and cyclic IM. Finally, although we focus on
metabolites, AllCCS also supports the CCS prediction for other small
molecules, like drugs, natural products, pesticides etc.
Metabolite annotation is one of the major bottlenecks for untargeted
metabolomics. Metabolite annotation usually requires high quality of
experimental MS/MS spectra. However, many experimental factors, such as
high sample complexity, low concentration and co-elution of isobaric
metabolites, present challenges to acquire high quality of MS2 spectra.
In contrast, the measurement of CCS value is less affected by
experimental factors, and could be accurately acquired from molecular
ions even with low abundances. Now, the use of multi-dimensional match,
including CCS match provides multi-dimensional characterization of
metabolites. For example, ~75% of low abundant features have reduced
candidate numbers with the addition of CCS match (Supplementary
Fig. [197]14). We demonstrated that the addition of CCS values to the
multi-dimensional match improved the annotation confidence with reduced
false candidates and improved ranks of correct candidates. Currently,
AllCCS does not directly process raw IM–MS data files. Instead, we aim
to incorporate AllCCS into other data processing tools (e.g.,
MS-DIAL4^[198]51) to accelerate the workflow from raw data processing
to compound identification. In addition, developers could also
integrate AllCCS with other data processing software tools such as
XCMS^[199]52 and MZmine^[200]53.
Unknown metabolites are generated from endogenous and exogenous
resources, and have no standard MS/MS spectra available. Currently,
unknown metabolite annotation is mainly performed using in-silico MS/MS
tools (e.g., MS-FINDER^[201]45 and SIRIUS^[202]54). In this work, we
further demonstrated that AllCCS provided a promising strategy to
integrate the predicted CCS and in-silico MS/MS tools for unknown
annotation, and facilitated extending the assessment of metabolic
pathways and activities with new chemical structures. These unknown
metabolites were created from KEGG compounds via in-silico enzymatic
reactions, because KEGG covers ~6000 species and various compounds,
including primary metabolites, secondary metabolites from plant and
bacterium, and xenobiotic compounds^[203]55. In the future, other
databases such as RECON^[204]56 are also good choices for metabolic
reconstruction towards human metabolism study. Finally, we believe this
strategy will have more powerful prospects when combined with other
data processing tools (e.g., MS-DIAL4) and advanced algorithms (e.g.,
GNPS^[205]9 and MetDNA^[206]10). Taken together, the AllCCS atlas has
provided a high-quality and unified CCS database for IM–MS, and further
opens a new avenue for known and unknown metabolite annotation in
IM–MS-based untargeted metabolomics.
Methods
Curation of the unified CCS database
A total of 5119 experimental CCS values were collected from 14
datasets, 4 independent labs, and 2 instrument platforms (Supplementary
Table [207]1), which were reported in recent publications from
2015–2018. To curate the unified CCS database, each dataset was cleaned
and standardized with a five-step procedure as follows (Supplementary
Fig. [208]1).
1. Collection of meta information. For each CCS record, the chemical
translation service^[209]57 [[210]http://cts.fiehnlab.ucdavis.edu/]
was utilized to generate the chemical identifiers for compounds,
such as InChIKey, CAS number, PubChem CID, etc. Then, the SMILES
structure for each compound was generated using an R package rinchi
[[211]https://github.com/CDK-R/rinchi]. The compound formula and
exact masses for different adducts (Supplementary Table [212]8)
were calculated using an R package rcdk
[[213]https://cran.r-project.org/web/packages/rcdk/index.html].
Finally, the chemical classification for each compound was obtained
using ClassyFire^[214]34 [[215]http://classyfire.wishartlab.com/].
2. Quality check. Some CCS records were intentionally removed for
those without chemical structures, with ion adducts not included in
Supplementary Table [216]8, or having large m/z errors (>10 ppm).
Then, for each dataset, we also removed the inconsistent CCS
records from the same instrument platform. For one ion adduct with
more than one CCS record, the maximum differences between CCS
records were calculated. If the maximum difference was >0.5%, the
related CCS records were removed. Otherwise, the averaged CCS value
was calculated and assigned as the CCS record.
3. Outlier removal. The CCS outliers were further removed using the
CCS trend lines, which was similar to the CCS compendium^[217]26.
The trend line of each super class (n ≥ 10) was fitted by a power
function, and the CCS records exceeding 99% of the predictive
interval were removed. A total of 103 CCS outliers were removed,
and two examples of outliers were confirmed in a recent
publication^[218]30 (Supplementary Fig. [219]2).
4. Calculation of unified CCS values. The CCS values from different
instrument platforms were further merged as unified CCS values. The
unified CCS value is an average of CCS values from different
instrument platforms, which is specific to the compound and its
adduct. Specifically, for one ion adduct, if it had multiple CCS
records obtained from DTIM-MS, the unified CCS value was the
average value from the CCS records in DTIM-MS. Otherwise, the
unified CCS value was calculated using all CCS records from
different platforms. A total of 3539 unified CCS values were
generated.
5. Assignment of confidence levels. For each unified CCS value, we
assigned a confidence level using the following rules: Level 1: the
unified CCS is calculated using experimental CCS records from ≥2
independent datasets in DTIM-MS instruments, and the maximum CCS
difference is ≤1%; Level 2: the unified CCS is calculated using
experimental CCS records from ≥2 independent datasets in different
commercial instruments (DTIM-MS, TWIM-MS, or TIMS-MS), and the
maximum CCS difference is ≤3%; Level 3: the unified CCS is only
reported in one dataset from commercial instruments (DTIM-MS,
TWIM-MS, or TIMS-MS); Conflict: the unified CCS is calculated using
experimental CCS records from ≥2 independent datasets in different
commercial instruments (DTIM-MS, TWIM-MS, or TIMS-MS), but the
maximum CCS difference is >3%. All predicted CCS values were
assigned as level 4 in AllCCS.
Training and validation sets for CCS prediction
AllCCS employed the unified CCS values for CCS prediction and
validation. Specially, 80% of unified CCS values (1851 and 795 CCS
values in positive and negative modes, respectively) were randomly
selected as the training set (Supplementary Data [220]1). Here, we only
kept seven most common adducts ([M + H]^+, [M + Na]^+, [M + NH[4]]^+
and [M + H-H[2]O]^+ for positive mode; [M-H]^-, [M + Na-2H]^-,
[M + HCOO]^- for negative mode), and removed CCS values with the
confidence level of conflict. In addition, two datasets were used for
performance validation: (1) external validation set 1 (metabolites and
lipids) consists of 463 and 199 CCS values in positive and negative
modes, respectively (Supplementary Data [221]2); (2) external
validation set 2 (drugs and natural products) consists of 107 and 122
CCS values in positive and negative modes, respectively (Supplementary
Data [222]3). Both validation sets were acquired using chemical
standards on Agilent DTIM-MS 6560. The acquisition of CCS values and
the standard MS/MS spectra followed the previous publications^[223]27.
Molecular descriptor calculation and selection
For each compound, a total of 221 molecular descriptors (MDs) were
calculated using the SMILES structure and the R package rcdk. Among
them, non-differential MDs were first removed. The missing values for
the rest MDs were imputed using the KNN algorithm. All MD values were
normalized to Z-score and subjected to selection using the recursive
feature elimination with cross validation (RFECV) algorithm
(Supplementary Fig. [224]19). In order to eliminate the scale effect of
the training set, 50%, 60%, 70%, 80%, or 90% of the training set were
used for RFECV. For each condition, the RFECV was performed by 200
times (1000 times in total). In each RFECV, the least important MD was
recursively removed according to the coefficient of the LASSO
regression via a tenfold cross validation. The MD combination with
highest scores in the cross validation were kept. Finally, MDs with the
frequency >700 in 1000 RFECV replications were ultimately selected. In
positive and negative modes, 15 and 9 MDs were selected, respectively
(Supplementary Table [225]9). We also demonstrated that the selected
MDs showed smaller prediction errors than those obtained from the
step-wise selection or the random selection (Supplementary Fig. [226]20
and Supplementary Table [227]10). The python software sklearn
[[228]https://scikit-learn.org/stable/] was used for RFECV.
Support vector regression-based CCS prediction
The support vector regression (SVR) algorithm was used to develop the
CCS prediction using the selected MDs and CCS values in the training
set. The general workflow was similar as our previous
publications^[229]29. Briefly, two hyper-parameter cost of constraints
violation (C) and gamma (γ) were optimized from 105 combinations via a
tenfold cross validation with 100 repeats. Seven groups of C value
(0.001, 0.005, 0.025, 0.05, 0.1, 0.25, 0.5)/N[MD] and 15 groups γ-value
(2 to 2^15) were set for parameter optimization. Radial basis function
was employed for kernel function. N[MD] represented the number of
selected MDs. Finally, the hyper-parameter combinations were selected
as follows: C, 0.1/15 and 0.1/9 in positive and negative modes,
respectively; γ, 2^8 and 2^13 in positive and negative modes,
respectively. As a result, 1.67% and 1.72% of MREs were obtained for
the training set in positive and negative modes, respectively
(Supplementary Table [230]11). In addition, the high gamma parameters
indicated that the optimized parameters in SVR prediction make the
model towards a linear regression, but has better performances
comparing to multiple linear regression (Supplementary Table [231]12).
Representative structure similarity
The representative structure similarity (RSS) was calculated to
characterize the structure similarity between the inputted structure
and the training set (Supplementary Fig. [232]21). The molecular
fingerprint of inputted structure was first computed using the R
package rcdk. Then, the structure similarity between the inputted
structure and each structure in the training set was calculated using
the tanimoto coefficient (TC) shown as follows:
[MATH: TCStrA,StrB=
NStrA∩StrBNStrA+N
StrB−N
StrA∩StrB :MATH]
1
where N[StrA] and N[StrB] were the molecule fingerprints of structures
A and B, respectively, and TC[(StrA,StrB)] was the TC between structure
A and structure B. Here, structure A was the inputted structure and
structure B was a structure in training set. N[StrA∩StrB] was the
intersection set of structure A and B. Then, RSS score of the inputted
structure was calculated using the average of top five TCs:
[MATH: RSSStrA=∑i=15TCi/<
/mo>5 :MATH]
2
where RSS[StrA] was the RSS of the inputted structure A, and TC[i]
represented top i tanimoto coefficient.
Benchmark of CCS prediction performance
The generation of CCS values using MetCCS^[233]27, DeepCCS^[234]30, and
ISiCLE^[235]31 for compounds in the external validation sets was
performed as follows. For MetCCS, the webserver
[[236]http://www.zhulab.cn/MetCCS/] was used to predict CCS values. The
inputted molecular descriptors of each compound were calculated by
ChemAxon MarvinSketch (Version 16.10.24) and ALOGPS
[[237]http://www.vcclab.org/web/alogps/]. For DeepCCS, CCS values were
calculated using the SMILES structures and the python package
downloaded from the internet ([238]https://github.com/plpla/DeepCCS, on
April 2nd, 2019). For ISiCLE, CCS values generated from ISiCLE Lite
v0.1.0 were directly downloaded from the webserver
[[239]https://metabolomics.pnnl.gov/ccs/] on March 11th, 2019. All CCS
values were provided in Supplementary Data [240]4.
AllCCS webserver
The AllCCS webserver was hosted on a Linux server from Alibaba Cloud,
and free-accessible for non-commercial use via
[241]http://allccs.zhulab.cn/. AllCCS webserver has three major
functions: (1) the unified and predicted CCS databases, (2) the CCS
prediction, and (3) metabolite annotation. The predicted AllCCS
database contains a total of 1,670,596 compounds and 11,697,711
predicted CCS values. These compounds are downloaded from KEGG^[242]35,
HMDB^[243]36, LMSD^[244]37, MINE^[245]38, DrugBank^[246]39,
DSSTox^[247]40, and UNPD^[248]41 databases (Supplementary
Table [249]13). The CCS prediction function provides a visualized
interface for users to predict CCS values with the inputted SMILES
structures. The metabolite annotation provides a feature match function
to search the AllCCS database with experimental m/z and CCS values. In
addition, it also provides a candidate rank function to perform
multi-dimensional annotation by integrating the annotation results from
in-silico MS/MS prediction tools. The tutorial of AllCCS is available
on the website.
CCS match, MS/MS match, and multi-dimensional match
A trapezoidal score function was developed to measure the CCS match.
First, it removed the candidates with CCS values exceeding the maximum
tolerance, then calculated the CCS match score (S[ccs]) using a
trapezoidal function as Eq. [250]3:
[MATH: Sccs=1,ΔrelaTOLmin1−Δrela−TOLminTOLmax−TOLmin,TOLmin≤Δrela≤TOLmax0,Δrela>TOLmax
:MATH]
3
where TOL[min] and TOL[max] are minimum and maximum tolerances,
respectively. The default values for TOL[min] and TOL[max] are 2% and
4%, respectively. The Δ[rela] is relative CCS error calculated as
Eq. [251]4.
[MATH: Δrela=CCSPred−CCSExpCCSExp×100 :MATH]
4
The experimental MS/MS spectra and their possible candidates were
imported into in-silico MS/MS tools to perform MS/MS match. Three
in-silico MS/MS prediction tools, such as MetFrag^[252]44,
CFM-ID^[253]43, and MS-FINDER^[254]45 were used in this work. The
format of imported data was modified according to the requirements of
each tool. The brief procedures are described as follows: (1) MetFrag:
the command line version MetFragCL (version 2.4.5-CL) was downloaded
from [255]https://ipb-halle.github.io/MetFrag/, and the parameter file
was generated via R package ReSOLUTION
[[256]https://github.com/schymane/ReSOLUTION]; (2) CFM-ID: the software
version 2.4 was downloaded from
[257]https://sourceforge.net/projects/cfm-id/files/. The pre-trained
model params_se_cfm and the parameter file param_output0.log were used.
The predicted MS/MS spectra were provided as MSP format in
Supplementary Data [258]5. (3) MS-FINDER: the software version 3.24 was
downloaded from
[259]http://prime.psc.riken.jp/Metabolomics_Software/MS-FINDER/index.ht
ml, and run with the console. The detail parameters of each tool were
provided in Supplementary Table [260]14. The experimental MS/MS
spectral library was downloaded from GNPS with a total of 13,499
compounds ([261]https://gnps.ucsd.edu/ProteoSAFe/libraries.jsp;
accessed on May 23th, 2020). The spectral match utilized reverse
dot-product scores, and its parameters were kept same with our previous
publication^[262]32.
Multi-dimensional match was performed by integrating the CCS match
score and MS/MS match score as Eq. [263]5:
[MATH: Sintegrated=Wccs×S<
/mi>ccs+W<
/mi>MS/MS×
SMS/MS
:MATH]
5
where S[CCS] and S[MS/MS] are CCS and MS/MS match scores, respectively.
Here, S[MS/MS] is the similarity between experimental MS/MS and
in-silico MS/MS, which is obtained from different in-silico MS/MS tools
with different scoring methods. The S[MS/MS] is rescaled to 0–1 before
integration. The W[ccs] and W[MS/MS] are weights for the CCS and MS/MS
match scores, respectively. The W[ccs] and W[MS/MS] were optimized as
0.3 and 0.7, respectively (Supplementary Fig. [264]22).
Chemicals
LC–MS grade methanol (MeOH) and water (H[2]O) were purchased from
Honeywell (Muskegon, MI, USA). LC–MS grade acetonitrile (ACN) was
purchased from Merck (Darmstadt, Germany). LC–MS grade methylene
dichloride (CH[2]Cl[2]) was purchased from Fisher Scientific (Morris
Plains, NJ, USA). Ammonium hydroxide (NH[4]OH) and ammonium acetate
(NH[4]OAc) were purchased from Sigma (St. Louis, MO, USA). The chemical
standard succinoadenosine was purchased from J&K (Shanghai, China),
while other chemical standards were purchased from TopScience
(Shanghai, China).
Sample preparation
Aging mouse liver tissues (c57BL-6J; 36-week vs. 104-week; n = 10 for
each group) were dissected, frozen with liquid nitrogen, and stored at
−80 °C. The mouse tissue studies were approved by Animal Ethics and
Welfare Management Committee of Interdisciplinary Research Center on
Biology and Chemistry, Chinese Academy of Sciences (Shanghai, China).
Metabolite extraction followed our published protocol^[265]10. In
brief, 10 mg of mouse liver tissues was firstly homogenized with 200 μL
of H[2]O and 20 ceramic beads (diameter, 0.1 mm) using a homogenizer
(Precellys 24, Bertin Technologies) at the low-temperature condition.
The protein concentration of the homogenized solution was measured with
the Pierce BCA Protein Assay Kit (Catalog No. 23225, Thermo Fisher) for
normalization. One-hundred microliters of homogenized solution was used
for metabolite extraction. A total of 100 μL of H[2]O and 800 μL of
solvent mixture of ACN:MeOH (1:1, v/v) was added, and vortexed for
30 s, and sonicated for 10 min at 4 °C water bath. After incubation for
1 h at −20 °C, the sample was further centrifuged for 15 min at
16,200 × g and 4 °C. The supernatant was collected and evaporated to
dryness at 4 °C. The dry extracts were then reconstituted into 100 μL
of ACN:H[2]O (1:1, v/v), followed by sonication at 4 °C for 10 min, and
centrifuged at 16,200 × g and 4 °C for 5 min to remove the insoluble
debris before LC–IM–MS/MS analysis.
Other biological samples were prepared as follows. For plasma, 100 μL
of human plasma (Catalog No. HPH-0500, Equitech-Bio. Inc, USA) was
extracted using 400 μL of solvent mixture of MeOH:ACN (1:1, v/v) in the
centrifuge tube, and then the mixture was vortexed for 30 s and
sonicated for 10 min at 4 °C water bath. The rest of the procedure was
the same as described for mouse liver tissue sample. For cell samples,
RIPK1^-/- mouse embryonic fibroblasts (MEFs) cell line (generated from
RIPK1 KO mice) were provided from Prof. Junying Yuan’s Lab (Chinese
Academy of Sciences, Shanghai). One milliliter of MeOH:ACN:H[2]O
(2:2:1, v/v/v) solvent mixture was added to the samples, followed by
vortex for 30 s and sonication for 10 min at 4 °C water bath. Then the
samples were incubated in liquid nitrogen for 1 min, thawed on ice, and
sonicated for 10 min at 4 °C water bath. The above vortex–freeze–thaw
cycle was repeated three times. The rest of the procedure was the same
as described for mouse liver tissue sample. For fruit fly head samples,
the sample collection and extraction followed our previous
publication^[266]10.
LC–IM–MS/MS analysis
A UHPLC system (Agilent 1290 series) coupled to a quadruple
time-of-flight mass spectrometer equipped with an ion mobility drift
tube (Agilent DTIM-QTOF-MS 6560, Agilent Technologies, USA) was used
for LC–IM–MS/MS data acquisition. The LC separation was performed on a
Waters BEH Amide column (particle size, 1.7 μm; 100 mm
(length) × 2.1 mm (i.d.)) maintained at 25 °C. The solvent A was 100%
H[2]O with 25 mM NH[4]OAc and 25 mM NH[4]OH, and solvent B was 100%
ACN. The flow rate was 0.3 mL/min, and the gradient was described as
follows: 0–1 min: 95% B, 1–14 min: 95% B to 65% B, 14–16 min: 65% B to
40% B, 16–18 min: 40% B, 18–18.1 min: 40% B to 95% B and 18.1–23 min:
95% B. The sample injection volume was 2 μL.
The data acquisition was operated in IM-Q-TOF mode. The source
parameters were set as follows: sheath gas temperature, 325 or 275 °C
in positive or negative modes; dry gas temperature, 300 °C; sheath gas
flow, 11 L/min; dry gas flow, 8 L/min; capillary voltage, 4000 V or
−3000 V in positive or negative modes, respectively; and nebulizer
pressure, 20 or 25 psi in positive or negative modes, respectively. The
TOF mass range was set as m/z 50–1700 Da. For ion mobility parameters,
the nitrogen (N[2]) was used for the drift gas. Other related IM
parameters were set as follows: entrance and exit voltages of drift
tube, 1600 and 250 V; trap filling and trap release times, 20,000 and
150 μs. The pressure of drift tube was set at 3.95 Torr. The MS/MS
spectra were acquired in the “Alternating frames” mode, and the
collision energy was fixed at 20 V in frame 2. The CCS values were
calculated with single electric field method. All data acquisitions
were carried out using MassHunter Workstation Data Acquisition Software
(Version B.08.00, Agilent Technologies, USA).
Chemical standards were first dissolved at 0.01 mg/mL in either H[2]O,
MeOH, CH[2]Cl[2], DMSO, or their mixture with different proportions
depending on compound polarity and solubility, and subject to
measurements of CCS values and MS/MS spectra. The CCS values were
independently measured three times across 2 months using a single-field
approach on Agilent DTIM-QTOF-MS 6560 instrument according to our
previous publication^[267]27. The MS/MS spectra were acquired using
targeted MS/MS method with three different collision energy levels (10,
20, and 40 V).
Data processing and metabolite annotation
Raw MS data files (.d) were first recalibrated using IM–MS Reprocessor
(Version B.08.00, Agilent Technologies). Then, the smoothing and
saturation repair were performed using PNNL PreProcessor (Version
2018.06.02). The CCS calibration was performed by IM–MS Browser
software (Version B.08.01, Agilent Technologies). The pre-processed
data files were submitted for feature finding, alignment, and MS/MS
spectra extraction using Mass Profiler (Version 10.0, Agilent
Technologies). Finally, the peak table and MS/MS spectra (CEF format)
files were exported for metabolite annotation. One MS/MS spectrum with
highest intensity was selected for each feature, similar to the
protocol in LipidIMMS Analyzer^[268]32. The detail parameters of data
processing tools were provided in Supplementary Table [269]15. The
metabolites were annotated using multi-dimensional match as we
described before. The m/z tolerance was set at 25 ppm, and only [M+H]^+
and [M-H]^- adducts were considered for positive and negative modes,
respectively. The MS-FINDER was used for in-silico MS/MS match, and
kept chemical structures within top 3 formulas for unknown metabolite
annotation. The known metabolite database (KEGG and HMDB) and the
extended database were used for known and unknown metabolite
annotation, respectively.
Generation of unknown metabolites
Unknown metabolites were generated based on in-silico enzymatic
reaction via BioTransformer^[270]46 (version 1.0.8). The command line
tool was used and downloaded from
[[271]https://bitbucket.org/djoumbou/biotransformer/src/master/]. The
SMILES structures of KEGG compounds were used for in-silico reaction,
and the “EC-based transformation” was used for metabolic
transformation. The reaction step was set as 2. All generated
metabolites were merged by InChIKey, and their SMILES structures were
converted via Open Babel^[272]58. Finally, a total of 100,404 unknowns
were finally generated and included in the extended database
(Supplementary Data [273]6). These compounds and their predicted CCS
values were also provided in AllCCS webserver.
Metabolic pathway and structure enrichment analysis
For the analysis of aging mice samples, the peak intensity table from
Mass Profiler was first normalized to the protein concentration from
BCA. Then, zero imputation with KNN algorithm was performed. Student’s
t-test was used for calculating p-value. The metabolic pathway and
chemical structure enrichment analyses were performed via
hypergeometric test^[274]59 and Kolmogorov–Smirnov (KS) test^[275]60,
respectively. All chemical classes of unknowns were obtained using
ClassyFire. The quantitative analysis followed our previous
publication^[276]10, and the z-scale normalization of peak intensities
was used in this work.
Reporting summary
Further information on research design is available in the [277]Nature
Research Reporting Summary linked to this article.
Supplementary information
[278]Supplementary Information^ (4.3MB, pdf)
[279]Peer Review File^ (574.2KB, pdf)
[280]41467_2020_18171_MOESM3_ESM.docx^ (15.8KB, docx)
Description of Additional Supplementary Files
[281]Supplementary Data 1^ (346.2KB, xlsx)
[282]Supplementary Data 2^ (121.4KB, xlsx)
[283]Supplementary Data 3^ (64.6KB, xlsx)
[284]Supplementary Data 4^ (127.9KB, xlsx)
[285]Supplementary Data 5^ (72.9MB, zip)
[286]Supplementary Data 6^ (8.8MB, xlsx)
[287]Supplementary Data 7^ (3.1MB, xlsx)
[288]Supplementary Data 8^ (2.9MB, xlsx)
[289]Reporting Summary^ (277.7KB, pdf)
Acknowledgements