Abstract

   The wide applications of liquid chromatography - mass spectrometry
   (LC-MS) in untargeted metabolomics demand an easy-to-use, comprehensive
   computational workflow to support efficient and reproducible data
   analysis. However, current tools were primarily developed to perform
   specific tasks in LC-MS based metabolomics data analysis. Here we
   introduce MetaboAnalystR 4.0 as a streamlined pipeline covering raw
   spectra processing, compound identification, statistical analysis, and
   functional interpretation. The key features of MetaboAnalystR 4.0
   includes an auto-optimized feature detection and quantification
   algorithm for LC-MS1 spectra processing, efficient MS2 spectra
   deconvolution and compound identification for data-dependent or
   data-independent acquisition, and more accurate functional
   interpretation through integrated spectral annotation. Comprehensive
   validation studies using LC-MS1 and MS2 spectra obtained from standards
   mixtures, dilution series and clinical metabolomics samples have shown
   its excellent performance across a wide range of common tasks such as
   peak picking, spectral deconvolution, and compound identification with
   good computing efficiency. Together with its existing statistical
   analysis utilities, MetaboAnalystR 4.0 represents a significant step
   toward a unified, end-to-end workflow for LC-MS based global
   metabolomics in the open-source R environment.

   Subject terms: Software, Metabolomics
     __________________________________________________________________

   Several bottlenecks exist in metabolomics data analysis. Here, the
   authors present MetaboAnalystR 4.0 as a unified workflow for LC-MS
   untargeted metabolomics. It highlights significant improvements in
   LC-MS2 spectral processing and functional analysis, providing an
   end-to-end computational pipeline.

Introduction

   Metabolomics involves the comprehensive identification and
   quantification of small compounds in biological samples using various
   analytical techniques^[36]1. Liquid chromatography - mass spectrometry
   (LC-MS) has been the primary analytical platform for global or
   untargeted metabolomics and exposomics^[37]2,[38]3. Following spectra
   acquisition, common bioinformatics tasks include spectra processing,
   compound identification, statistical analysis, and functional
   interpretation of key patterns and significant signatures^[39]4. Over
   the past decades, powerful computational tools have been developed for
   individual tasks such as XCMS, MS-DIAL, MZmine, and asari for raw
   spectral processing^[40]5–[41]9; GNPS, SIRIUS, and MS-FINDER for
   compound identification^[42]10–[43]12; MetaboAnalyst and various R
   packages for statistical analysis^[44]13,[45]14; KEGG and mummichog for
   functional interpretation^[46]15–[47]17. Coordinating these different
   tools to enable streamlined data analysis and comprehensive
   understanding represents a major barrier to current LC-MS-based
   metabolomics and exposomics studies.

   To enable a unified metabolomics workflow, several critical issues need
   to be addressed in spectra processing and interpretation. To facilitate
   both quantitative analysis and compound identification, LC-MS
   untargeted metabolomics are typically conducted with MS1 full scans
   coupled with tandem MS (MS2) using data-dependent acquisition (DDA) or
   data-independent acquisition (DIA) methods^[48]18,[49]19. Although DDA
   usually acquires MS2 spectra by fragmenting precursor ions within a
   relatively narrow window, recent studies show that over half of them
   are chimeric and need to be deconvolved^[50]18. In contrast, DIA
   fragments all ions in a wider m/z range with multiple cycles to improve
   the metabolome coverage, and spectra deconvolution is essential to
   relink precursors with fragment ions. Sequential window acquisition of
   all theoretical fragment ion mass spectra (SWATH-MS) is a specific
   variant of DIA to help overcome DDA drawbacks by leveraging all
   ions^[51]20. Despite multiple tools and
   algorithms^[52]12,[53]18,[54]21–[55]24, integrated LC-MS1 and MS2
   spectra processing remains a primary bottleneck in terms of
   computational efficiency as well as the requirement of deep knowledge
   for parameter settings associated with using these tools. Most of them
   can only process either DDA or DIA data. There is a need for a highly
   efficient tool capable of supporting both methods. Secondly, MS2-based
   compound identification relies heavily on the availability of reference
   spectra databases. The lack of a comprehensive yet easily customizable
   reference spectra database remains an essential gap in current global
   metabolomics. Thirdly, compound identifications typically involve
   matching m/z values and retention times of MS1 features, as well as
   their associated MS2 patterns against reference databases. This process
   often yields multiple candidates and requires further time-consuming
   manual curation before functional interpretation can be performed.
   Traditional functional enrichment analysis relies on accurate compound
   identification. Recent studies have shown that, despite uncertainties
   at individual compound levels, functional activities can be reliably
   predicted by leveraging the patterns of the putative identifications
   based on m/z values and retention times of MS1 features^[56]15,[57]16.
   The performance could be further improved by utilizing MS2 spectral
   data. To meet these emerging demands from the growing applications of
   metabolomics across broad areas, we have developed MetaboAnalystR
   version 4.0 to integrate LC-MS1 and MS2 spectra processing, compound
   identification, statistical analysis, as well as functional
   interpretation within a unified software environment.

   MetaboAnalystR 1.0 was released in 2018 as the underlying R package of
   the popular MetaboAnalyst web server to facilitate transparent,
   reproducible statistical data analysis. The subsequent releases
   (version 2.0 in 2019 and version 3.0 in 2020) added important
   capacities for LC-MS1 spectra processing and functional interpretation,
   respectively. The R package allows easy creation of flexible data
   processing pipelines that complement the pre-defined workflows by the
   MetaboAnalyst web server. It is well received by the
   metabolomics community as evidenced by >1000 citations^[58]25–[59]27.

   Here, we introduce MetaboAnalystR 4.0 developed as a unified
   computational workflow for LC-MS-based global metabolomics. In addition
   to enhancing and consolidating the LC-MS1 spectral processing and
   associated statistics and functional analysis modules, MetaboAnalystR
   4.0 contains several key features including (1) an auto-optimized DDA
   data deconvolution workflow to deal with chimeric spectra; (2) an
   efficient SWATH-DIA data deconvolution pipeline; (3) a comprehensive
   collection of reference spectra databases (>1.5 million spectra)
   coupled with common search algorithms supporting custom plug-in
   databases, and (4) more accurate functional activity prediction by
   integrating LC-MS1 and MS2 results. To better understand the
   performance characteristics of both individual components as well as
   the complete workflow, we performed comprehensive validation studies
   using seven different datasets ranging from standards mixtures,
   dilution series, exposomics data to clinical samples.

Results

General workflow

   MetaboAnalystR 4.0 features end-to-end support for metabolomics data
   analysis, spanning from raw LC-MS spectra processing to statistical and
   functional analysis. In addition to generic data tables, compound or
   LC-MS peak lists, MetaboAnalystR accepts raw spectra in open formats
   (mzML, mzXML, mzData, netCDF and MGF) from LC-MS metabolomics
   experiments, as well as output files from other well-established
   spectra processing tools, including MS-DIAL, MZmine, XCMS, asari,
   MS-FINDER, and SIRIUS. The pre-processed spectral peaks and identified
   compounds will be automatically formatted and enter a unified pipeline
   for downstream processing, statistics, and functional analysis as shown
   in Fig. [60]1.

Fig. 1. LC-MS untargeted metabolomics workflow based on MetaboAnalystR 4.0.

   [61]Fig. 1
   [62]Open in a new tab

   Mass spectra are processed separately for MS1 and MS2 levels based on
   its auto-optimized workflows. MetaboAnalystR 4.0 also accepts
   pre-processed results from other spectra processing tools. The
   resulting metabolite feature table will be filtered and/or normalized
   for statistical and functional analysis.

LC-MS spectra processing

   MetaboAnalystR features an auto-optimized LC-MS1 spectra processing
   pipeline since version 3.0. The workflow involves extracting regions of
   interest followed by parameter optimization based on the design of
   experiments^[63]27. The optimized parameters are subsequently used for
   peak detection, quantification, alignment, and annotation. This
   approach achieves good performance with high computing efficiency. In
   version 4.0, the LC-MS spectra pre-processing workflow has been
   significantly expanded to support integrated analysis of MS1 and MS2
   spectra data (DDA or SWATH-DIA).

   For DDA data, MetaboAnalystR first assigns all MS2 spectra from a
   single sample into different feature groups based on the mass-to-charge
   ratio (m/z) and retention time of their precursor ions. The chimeric
   status is evaluated based on the nearest MS scans. Briefly, MS2 spectra
   of all ions (including the main precursor and other contaminating ions
   within the isolation window and above the intensity threshold) are
   extracted from MS2 spectral reference libraries as candidate spectra
   (“Methods” section, Fig. [64]2a). If any reference spectrum is missing,
   a predicted spectrum will be generated using a similarity-network
   model^[65]28 (“Methods” section, Fig. [66]2b). All candidates are then
   used to obtain the deconvolved spectrum based on a self-tuned
   regression algorithm (“Methods” section). The SWATH-DIA data processing
   is based on the DecoMetDIA approach^[67]23, with the core algorithm
   re-implemented using the Rcpp/C++ framework and further optimized to
   support parallel computing to address the computational bottleneck
   associated with the original implementation.

Fig. 2. DDA spectra deconvolution in MetaboAnalystR 4.0.

   [68]Fig. 2
   [69]Open in a new tab

   a The overall workflow. All spectra are evaluated as “clean” or
   “chimeric” based on the MS signals of the nearest MS1 scan. The
   reference spectra of all ions (A, B, and C) within the isolation window
   and above the intensity threshold are extracted from a reference MS2
   library for regression analysis. b Diagram of reference spectrum
   prediction strategy. MetaboAnalystR can predict one or more candidates
   missing in the MS2 library. Formula(s) of the ion (e.g., ion C) is
   predicted at first. An abiotic/bio-transformation network is
   constructed for the formula (e.g., C[2]H[7]NO[3]S), and all neighbors
   with reference spectra of the formula are extracted as a list. Each
   fragment of the single spectrum in the list is predicted in a formula.
   If the formula (e.g., C[3]H[9]) of the fragment includes more chemical
   elements (number or type) than the original formula (e.g.,
   C[2]H[7]NO[3]S), it is excluded from the spectrum. The clean spectrum
   is returned to the list. Their similarities to the original MS2
   spectrum of all spectra in the list are evaluated, and the one with the
   highest similarity score is selected as the predicted spectrum for the
   ion.

   Following the deconvolution step, all MS2 spectra from replicates (if
   any) for a specific MS feature undergo a consensus step to generate a
   single spectrum to minimize potential errors and noise. This process
   may optionally utilize the observed frequency of a specific fragment.
   For instance, if a fragment is detected only once in replicates and is
   not seen in any reference MS2 spectra, it may be optionally excluded
   (“Methods” section, Supplementary Fig. [70]1).

   The consensus spectra are searched against a reference spectra database
   for compound identification (“Methods” section). Users can choose
   different databases based on instrument types, collision energies, and
   other common options. Two widely used methods – dot product and
   spectral entropy have been implemented to evaluate MS2 matching
   similarity^[71]29. The candidates for a particular feature are scored
   by considering m/z, retention time, isotope, and MS2 similarity score
   together^[72]5. The matching scores range between 0 and 100, where 0
   indicates no matching and 100 indicates a perfect match (“Methods”
   section). The top N candidates, defined by users, can be exported as
   the database search results. If the matching score is below 10,
   MetaboAnalystR optionally performs a neutral loss scan to further
   improve the compound identification rate^[73]30.

   To be interoperable with other widely used tools, MetaboAnalystR
   accepts LC-MS1 feature detection results from MS-DIAL, MZmine, asari,
   and XCMS^[74]5–[75]9. The results can be automatically formatted into
   the generic format for processing by MetaboAnalystR. Additionally,
   MetaboAnalystR accepts the pre-processed LC-MS2 results in MSP or
   MGF files generated by MS-DIAL and MZmine. Users could input these
   files directly for MS2 database search and export the results for
   downstream analysis. MetaboAnalystR also offers functionalities to
   convert the compound identification results from MS-FINDER and SIRIUS
   for functional analysis^[76]10–[77]12.

Comprehensive MS2 spectra database

   MetaboAnalystR 4.0 comes with a comprehensive reference spectra
   database organized under five different themes or libraries, including
   pathway compound library, biology compound library, lipid library,
   exposome library, and the complete library. A summary of these five
   libraries is presented in Table [78]1. These databases have been
   curated from public repositories, including HMDB^[79]31, MoNA
   Series^[80]32, LipidBlast^[81]33, MassBank^[82]32, GNPS^[83]22,
   LipidBank^[84]34, MINEs^[85]35, LipidMAPs^[86]36, and KEGG^[87]17
   (“Methods” section). All fragments in the MS2 spectra reference
   libraries have been further annotated into formulas using BUDDY^[88]37.
   In addition, the corresponding neutral loss spectra databases have also
   been pre-calculated, corresponding to the five databases above. The
   underlying spectra, which include original MS2 spectra, neutral loss,
   and annotated fragments, are stored as SQLite files available for
   download. The simple schema of the SQLite database allows users to
   easily prepare their own in-house custom databases for compound
   searching.

Table 1.

   Summary of MS2 reference spectra libraries curated from different
   public resources
         Name        Records   Unique compounds Size (MS2 | neutral loss)
   Complete library 10,420,215 1,551,012        7.2 | 6.4 GB
   Pathway library  172,370    3456             138.2 | 94.1 MB
   Biology library  864,386    49,055           744.0 | 491.0 MB
   Exposome library 1,883,828  106,387          1.5 | 1.1 GB
   Lipid library    3,221,409  878,220          1.9 | 1.1 GB
   [89]Open in a new tab

Integrating LC-MS data processing results for accurate functional insights

   Obtaining functional insights is among the main interests in
   metabolomics studies. Conventional approaches generally require manual
   annotation of a significant portion of spectral features, which is a
   very time-consuming process. The mummichog algorithm^[90]15 has
   demonstrated that putatively annotated compounds from high-resolution
   MS peaks are able to collectively point to the correct pathways. A
   pivotal step in this process is provisionally mapping m/z information
   to empirical compounds. The initial algorithm (version 1.0) uses m/z
   values of MS peaks. The subsequent release (version 2.0) incorporates
   peak retention times. MetaboAnalystR 4.0 further utilizes MS2
   information to filter out impractical assignments to improve accuracy
   and specificity. After MS1 and MS2 spectral processing, MetaboAnalystR
   can automatically perform statistical analysis from the peak intensity
   table and format the database searching results for functional
   enrichment analysis (“Methods” section). The default database for
   functional analysis is based on the known pathways curated from KEGG
   and BioCyc^[91]38. Users can also choose other functional libraries
   based on disease signatures, chemical ontologies, SNP-associated
   metabolites, etc^[92]39.

Benchmarking and case studies

   Using a total of seven datasets, including one whole blood exposomics
   dataset, one serial dilution dataset, three standards mixtures
   datasets^[93]18,[94]40,[95]41, and two clinical plasma metabolomics
   datasets^[96]42,[97]43, we evaluated the performance of MetaboAnalystR
   4.0 for LC-MS1 spectra processing and quantification, MS2 spectra
   processing and compound identification, statistical analysis, and
   functional interpretation. When possible, we also compared with those
   obtained using other commonly used tools, and the results were provided
   as [98]supplementary materials.

Performance of LC-MS1 spectral processing and quantification

   Understanding the chemical compositions of different types of blood
   samples is important, especially for diseases whose pathogenesis is
   closely related to blood cellular components^[99]44,[100]45. To
   evaluate the performance of LC-MS1 spectral processing workflow, we
   have generated a whole blood exposomics dataset to compare the
   differences in plasma, serum, and whole blood from 14 individuals using
   LC-MS-based global metabolomics. To ensure comprehensive metabolome
   coverage, LC-MS1 spectra were collected using two column types under
   positive and negative ion modes (C18-ESI^+, C18-ESI^−, HILIC-ESI^+, and
   HILIC-ESI^−) for each sample. MS2 has been acquired in both DDA and
   SWATH-DIA methods. The general design of this study is shown in
   Fig. [101]3a.

Fig. 3. Evaluation of LC-MS1 data processing and quantification.

   [102]Fig. 3
   [103]Open in a new tab

   a Design of the metabolomics experiment. b Statistics of LC-MS1
   features across different modes. Generic features, MS features detected
   across all blood samples; Unique features, features detected only in
   one type of blood samples. c PCA score plot of the blood dataset (C18,
   ESI^+). d Heatmap of the complete metabolic profiles (MS1 level, C18,
   ESI^+). Unique MS1 features for a specific blood sample type were
   highlighted with rectangles. Blue, unique features for serum compared
   to plasma; Green, unique features for plasma compared to serum. Ruby,
   unique features for whole blood compared to plasma and serum. e Design
   of serial dilutions. Urine and serum are mixed according to the ratio
   labeled at the x-axis. f Correlation analysis of MS features from
   serial dilutions. MetaboAnalystR reported high average correlation
   coefficients (>0.85) across different modes, compared to other tools
   (Supplementary Fig. [104]7).

   All MS features were detected with the auto-optimized LC-MS1 workflow
   in MetaboAnalystR. Over 12,000 MS1 features have been detected for each
   individual mode (Fig. [105]3b). Statistical analysis was performed on
   the feature abundance tables. Principal component analysis (PCA) shows
   significant differences in the metabolomes of the three different types
   of blood samples (Fig. [106]3c). A clustering heatmap was used to view
   abundance distributions of all MS1 features (Fig. [107]3d). Generally,
   whole blood contains more metabolite features compared to serum and
   plasma. Serum does not metabolically equate to plasma as shown by the
   presence of unique features in both data types. In this study, we
   focused on elucidating the distinct features among different types of
   blood samples. Over 2000 unique metabolic features have been found in
   this dataset (Fig. [108]3b). Similar results have also been observed
   from the other three datasets (Supplementary Figs. [109]2–[110]6).

   To further evaluate the quantification performance of MetaboAnalystR, a
   dilution series (n = 13) was prepared by mixing serum and urine samples
   with varying ratios in a cross-gradient manner^[111]7 (Fig. [112]3e,
   “Methods” section). The cross-gradient dilution is designed to minimize
   the potential bias caused by matrix effects in equal dilutions^[113]46.
   We conducted a Pearson correlation analysis between the dilution ratios
   and the MS1 feature intensities obtained using MetaboAnalystR. As
   illustrated in Fig. [114]3f, over 90% of MS1 features detected by
   MetaboAnalystR exhibited high average correlation coefficients (>0.85),
   indicating its excellent quantification performance. Compared to other
   tools under their default settings, MetaboAnalystR exhibits the overall
   best quantification results (Supplementary Figs. [115]7–[116]9).

Performance of LC-MS2 spectral processing

   Three standards mixtures (SM) datasets with different levels
   of complexities were downloaded from Mass Spectrometry Metabolite
   Library of Standards (MSMLS, IROA Technologies). The 1st SM dataset
   contains a mixture of 526 standards (complex mixture), including one
   DDA injection and one SWATH-DIA injection (single replicate for both
   ESI^+ and ESI^−). The 2nd SM includes 15 DDA samples, each containing
   10–15 non-isobaric compounds (simple mixture, single replicate for both
   ESI^+ and ESI^−). The 3rd SM dataset contains 91 compounds (medium
   complexity) and was analyzed by both DDA and SWATH-DIA (three
   replicates each for both ESI^+ and ESI^−)^[117]40. The chemical
   identity information for each compound (retention time, m/z, and
   InChIKey) in each SM dataset was utilized as the corresponding
   reference compound lists to assess the accuracy of compound
   identification.

   The 1st SM dataset was used to evaluate the performance of DDA
   deconvolution. The results indicate that performing deconvolution on
   DDA spectra has increased the number of compounds that were correctly
   identified as the top candidates (Table [118]2, Supplementary
   Tables [119]1–[120]3, Supplementary Fig. [121]10), and the matching
   scores of the identified compounds reported by the deconvolution
   pipeline were significantly improved compared to those obtained from
   the pipeline without deconvolution (Fig. [122]4a). Compared to other
   tools under their default settings, MetaboAnalystR detected the highest
   number of compounds with high computational efficiency (Table [123]2).
   Similar results were also observed from the 2nd SM dataset, with
   MetaboAnalystR showing a higher compound identification rate than other
   tools under their default settings (Fig. [124]4b, Supplementary
   Fig. [125]11). While manual parameter optimization by expert users may
   improve the performance of other tools, the MS2 spectra processing in
   MetaboAnalystR is auto-optimized (“Methods” section) to avoid such
   requirements.

Table 2.

   Summary of identified compounds from a standards mixture of 406
   compounds by different tools
             Tools           ^aDetected (MS1) ^bAnnotated (MS2) Percentage (%)
   MetaboAnalystR (deco)     239              159               39.2
   MetaboAnalystR (non-deco) 239              146               36.0
   MS-DIAL/MS-FINDER         165              82                20.2
   MZmine/SIRIUS             221              77                19.0
   [126]Open in a new tab

   The result is based on MS2 (DDA, ESI^+). The results from the other
   three modes are available in Supplementary Tables [127]1–[128]3.

   ^aDetected: number of MS1 peaks matched the m/z and retention time of
   the standards.

   ^bAnnotated: number of compounds correctly identified as the top hit
   from the MS2 spectra search.

Fig. 4. Evaluation of LC-MS2 data processing and result visualization.

   [129]Fig. 4
   [130]Open in a new tab

   a Comparison of matching scores of DDA (with or without deconvolution,
   ESI^+) in the complex standards mixture sample. The deconvolution
   algorithm could significantly improve the matching scores of chemical
   candidates in comparison to the non-deconvolved spectra. n = 99
   independent compounds, means ± SEM; paired one-tailed Student’s t-test
   (**, significant, p = 1.5e-7, no adjustment). b Statistical analysis of
   the compound identification results of the 2nd standards mixture
   dataset. Compared to the other two workflows, MetaboAnalystR reported a
   significantly higher compound identification percentage. n = 15
   independent datasets, means ± SEM; unpaired one-tailed Student’s t-test
   without adjustment (MetaboAnalystR vs. MS-DIAL/MS-Finder, significant,
   p = 0.008; MetaboAnalystR vs. MZmine/SIRIUS, significant, p = 0.0003;
   MS-DIAL/MS-Finder vs. MZmine/SIRIUS, insignificant, p = 0.14). c Number
   of identified compounds from the 3rd standards mixture dataset using
   different MS2 reference libraries. d Results of MS2 spectra-based
   compound identification for the blood exposomics dataset. e Example
   mirror plot of 2-Piperidinone illustrating the MS2 spectrum matching
   pattern. The upper side (blue) represents the user’s input, while the
   bottom side (red) displays the spectrum from the reference library.
   When the mouse hovers over the fragments, corresponding information,
   including m/z, relative intensity, and potential formula, is displayed.
   All matched fragments are marked with a red diamond at the top.

   Using the 3rd SM dataset, we first tested the number of compounds that
   could be identified correctly. For DDA dataset, MetaboAnalystR
   correctly identified most compounds by using either Complete library or
   Pathway library (Fig. [131]4c). To evaluate the false discovery rate,
   we generated a series of decoy spectra by randomly increasing m/z error
   and replacing the original MS2 spectra with synthetic spectra from
   isobaric compounds (“Methods” section, Supplementary Fig. [132]12). The
   results showed that MetaboAnalystR was able to maintain a low false
   positive rate (<25%, Supplementary Figs. [133]13 and [134]14).

   Finally, we evaluated the compound identification performance on
   complex biological samples using the blood exposomics dataset. Herein,
   we focused on the unique features of different types of blood samples.
   These features were used as target lists for MS2-based compound
   identification. As depicted in Fig. [135]4d, MetaboAnalystR identified
   over 500 compounds (level 2a) through DDA, which was ~5% more compared
   to compound identification based on non-deconvoluted DDA spectra
   (Supplementary Fig. [136]15). For instance, 2-Piperidinone
   (Fig. [137]4e) was only identified after deconvolution, with a matching
   score of 92. Over 1000 compounds were identified based on SWATH-DIA.
   The chemical composition analysis showed that whole blood samples
   contained more lipids, organic acids, and organic heterocyclic
   components in contrast to serum and plasma. The main chemical
   differences between serum and plasma are lipids (Supplementary
   Fig. [138]16).

Evaluating the complete workflow using clinical metabolomics data

   We used two clinical metabolomics datasets collected from two recent
   COVID-19 studies^[139]42,[140]43 to validate the complete workflow of
   MetaboAnalystR including raw LC-MS1 and MS2 spectra processing,
   compound identification, statistical analysis, and biological
   interpretation. The 1st dataset includes 160 samples for both polar
   metabolites and non-polar lipids datasets (ESI^+ and ESI^−, DDA,
   “Methods” section) categorized into five groups (Asymptomatic, Mild,
   Severe, Critical, and Fatal)^[141]42. The 2nd dataset includes 30
   samples (ESI^+ and ESI^−, SWATH-DIA, “Methods” section) categorized
   into disease and healthy control^[142]43.

   As shown in Fig. [143]5a, MetaboAnalystR detected over 5000 MS1
   features for all datasets. Compound identification result indicates
   that MetaboAnalystR could identify around 10%~24% of compounds (level
   2a, Fig. [144]5b), which is higher than other tools under their default
   settings (Supplementary Figs. [145]17–[146]19). Note the comparison
   with other workflows or tools could only be performed till this step.

Fig. 5. Biological interpretation of clinical metabolomics data.

   [147]Fig. 5
   [148]Open in a new tab

   a Summary of MS1 features detected from DDA and SWATH-DIA datasets. b
   Percentage of MS1 features identified based on MS2 spectra. c
   Comparison of significant pathways obtained by integrating MS2
   identification results (MS1 + MS2) or not (MS1). d Scatter plot of
   pathway enrichment analysis results of SWATH-DIA dataset. n = 30
   independent experiment samples. Fisher’s exact test without adjustment
   for functional analysis. e Venn diagram of pathway analysis results
   from SWATH-DIA datasets.

   We applied the enhanced mummichog algorithm using the results obtained
   above for function analysis. We chose to compare biological differences
   between the Mild and Fatal groups for the 1st dataset, and between the
   disease and healthy controls groups for the 2nd dataset. There are two
   sub-datasets (polar and non-polar lipids) in the 1st dataset and one
   dataset in the 2nd dataset. Therefore, a total of three comparisons
   were performed. In general, more pathways were reported when MS2 data
   were used (Fig. [149]5c, d). Compared to the result based only on MS1
   features, four more significant pathways were reported when considering
   the SWATH-DIA data (Fig. [150]5e). These pathways are related to
   phosphatidylinositol phosphate, vitamin D, vitamin C metabolism, and
   arachidonic cascades (Supplementary Tables [151]4–[152]6). They were
   reported to be related to the pathogenesis of the disease by the
   previous studies^[153]47–[154]50. All predicted active pathways
   (including those based on other tools) are summarized in Supplementary
   Figs. [155]20–[156]24.

Assessment of computational performance

   The two clinical metabolomics datasets were used to benchmark the
   computational performance of all tools included in this study. The
   assessments were conducted using a workstation (Dell OptiPlex 7070,
   64GB RAM, Intel-i7-9700 CPU, Ubuntu 20.04.2) controlled by Simple Linux
   Utility for Resource Management (SLURM^[157]51, “Methods” section).
   MetaboAnalystR completed raw LC-MS1 spectral processing within 5 h and
   MS2 spectral processing within 16 h for each of the three datasets,
   with memory usage below 25GB. Compared to other tools under their
   default settings, MetaboAnalystR achieved MS feature detection at a
   speed similar to MZmine and MS-DIAL (Supplementary Table [158]7 and
   Supplementary Figs. [159]25–[160]26). The memory consumption of
   MetaboAnalystR is comparable to that of MS-DIAL but lower than that of
   other tools. According to the execution logs, MS2 spectra searching in
   SIRIUS is based on application programming interface (API) calls to
   remote web services, which is highly dependent on the network traffic
   and responsiveness of the remote server. MS-FINDER also partially uses
   remote access to predict formulas. In comparison, MetaboAnalystR allows
   users to run the complete workflow locally without depending on network
   traffic or remote server load.

Discussion

   Despite significant progress made in the past decades, analyzing LC-MS
   data from untargeted metabolomics remains challenging. Researchers have
   to learn to use multiple tools and R packages in order to achieve
   comprehensive data analysis and understanding. In addition, better
   support for MS2 spectra processing is required to enable more accurate,
   high-throughput compound identifications and functional understanding.
   We have developed MetaboAnalystR 4.0 to help orchestrate LC-MS spectra
   processing, compound identification, statistical analysis, and
   biological interpretation within an open-source R environment.
   MetaboAnalystR accepts LC-MS1 and MS2 data in common open formats.
   Compound identification can be performed efficiently within the same
   workflow using the comprehensive reference spectra libraries. The
   results can be directly fed into the downstream statistical and
   functional analysis. Our comprehensive case studies have indicated that
   MetaboAnalystR 4.0 performs favorably when compared with other widely
   used tools in terms of spectral processing, compound identification,
   quantification, interpretation as well as computing efficiency.

   A key innovation and advantage of using MetaboAnalystR 4.0 for raw
   spectral processing lies in its auto-optimized design for common LC-MS1
   and MS2 spectra processing. Another important contribution is the
   manual curation and annotation of comprehensive MS2 spectra libraries
   coupled with efficient search algorithms for compound identification.
   Users can visually explore the MS2 spectra searching results through
   interactive mirror plots. As an open-source R package, MetaboAnalystR
   is designed to be extensible and interoperable with other tools and
   workflows. It not only accepts common data formats (such as csv and
   txt) but also supports standard formats (such as mzTab). MetaboAnalystR
   can also directly incorporate raw spectral processing results (such
   as MSP and MGF files) generated from other well-established tools into
   its workflow for downstream analysis. Finally, MetaboAnalystR 4.0 is
   the backend of the MetaboAnalyst 6.0 web application^[161]52, which has
   been developed to support users who are not comfortable with the R
   command-line interface.

   An important limitation of the current implementation of MetaboAnalystR
   4.0 is that the built-in database of MS2 reference spectra were
   compiled from various instruments and experimental conditions,
   resulting in potentially varying qualities in compound identification.
   In contrast, MS2 reference spectra from METLIN^[162]53 or other
   commercial tools originate from in-house experimentally validated
   databases under more standardized conditions. We encourage users to
   convert their own in-house databases into the compatible SQLite format
   for the MS2 search workflow.

   We will continue to improve the MetaboAnalystR package and the
   MetaboAnalyst web application to make LC-MS global metabolomics more
   accessible to the community. For future directions, we intend to add a
   new component to support the analysis of data from stable
   isotope-labeling experiments^[163]54,[164]55. MetaboAnalystR does not
   support ion-mobility spectrometry and all-ion fragmentation (AIF)
   data^[165]56. They have also been increasingly applied in metabolomics
   studies. The support for these features will be achieved in the future
   release.

   In conclusion, MetaboAnalystR 4.0 has significantly enhanced a series
   of important functions and implemented the much-requested support for
   MS2 spectra processing and compound identification in an efficient
   manner. In combination with its already well-established
   functionalities on statistical analysis and functional interpretation,
   MetaboAnalystR 4.0 offers a unified workflow to support LC-MS global
   metabolomics and exposomics studies.

Methods

   This research complies with all relevant ethical regulations. The whole
   blood exposomics study was approved by the Research Ethics Office of
   McGill University (Study ID: A05-M26-16B).

Chemicals

   Standard human serum and ammonium acetate (NH[4]AC) were purchased from
   Sigma-Aldrich (Sigma, St. Louis, MO, USA). Acetonitrile (ACN), methanol
   (MeOH), 0.1% formic acid (FA) in Water, 0.1% FA in ACN, and pure water
   were purchased from Fisher Chemical (Morris Plains, NJ, U.S.A.).

Blood sample preparation

   Healthy volunteers were recruited from McGill University as previously
   discussed^[166]57. About five milliliters of venous whole blood (WB)
   were drawn from each volunteer into a BD K2-EDTA Trace Element free
   Vacutainer. A sub-sample of this whole blood was used to obtain plasma
   (i.e., whole blood centrifuged for 15 min at 4 °C at 2700 rpm). From
   each individual, another blood sample was collected into a BD
   Vacutainer tube not containing any anticoagulant, which was allowed to
   sit for ~30–60 min for clots to form following which serum was obtained
   by centrifugation (15 min at 4 °C at 2700 rpm). Blood samples from 14
   individuals were collected and included in this study. All blood
   samples were paired with three different types (whole blood, serum, and
   plasma). All samples were immediately frozen at −80 °C until analysis.
   The demographic information of all subjects is summarized in
   Supplementary Table [167]8. This study was approved by the Research
   Ethics Office of McGill University (Study ID: A05-M26-16B).

   The different blood sample types were prepared based on the previously
   published protocols^[168]58,[169]59. The three blood sample types (WB,
   serum, and plasma) were thawed on ice for one hour, and then vortexed
   for 30 s to ensure homogeneity. One hundred µl of each sample type was
   transferred to a 1.5 ml Eppendorf microcentrifuge tube, to which 400 µl
   of −20 °C 1:1 ACN:MeOH (v/v) was added. Samples were vortexed for 60 s
   and stored at −20 °C for one hour. Samples were then centrifuged at
   16,100 × g for 10 min at 4 °C. The supernatants were collected (250 µl)
   and filtered by centrifugation using 0.2 µm Nanosep centrifugal filters
   (PALL Life Sciences) at 14,000 × g for 15 min at 4 °C. Filtered samples
   (120 µl) were then transferred to LC-MS vials equipped with 250 µl
   glass inserts and run in the LC-MS. A Quality Control (QC) sample was
   made by pooling equal volumes from each filtered samples supernatant
   into one 1.5 ml Eppendorf microcentrifuge tube.

Sample preparation for serial dilutions

   A urine sample was collected from a donor from the population mentioned
   above^[170]57. A total of 100 mL urine was sub-sampled and frozen at
   −80 °C. A human standard serum sample (Sigma-Aldrich, Sigma, St. Louis,
   MO, USA) and the urine were thawed on ice for one hour, and then
   vortexed for 30 s to ensure homogeneity. A total of 13 Eppendorf
   microcentrifuge tubes were prepared and labeled from A to M. For tubes
   A to E, 150 µl urine was transferred into each of them. For tubes G to
   K, 150 µl standard serum was transferred into each of them. 250 µl pure
   urine was transferred into tube L, and 250 µl pure serum was
   transferred into tube M. Then, 50 µl pure serum were extracted and
   mixed into tube E. Then, 50 µl liquids were extracted and mixed into
   tube D, and so on until tube A. Same operations were repeated for pure
   urine tube L, and tube G–K. Finally, 75 µl pure urine and serum were
   extracted respectively and mixed into tube F. As a result, a total of
   11 dilution mixtures and two pure samples were generated. The whole
   preparation workflow is shown in Supplementary Fig. [171]27. After the
   preparation of the serial dilutions, all samples were processed
   similarly as the blood samples, except that the ratio of organic
   reagents to samples is 1:2.5 instead of 1:4. All processed samples
   (120 µl) were then transferred to LC-MS vials equipped with 250 µl
   glass inserts and run in the LC-MS. No QC samples were prepared in this
   case study. Three replicates were prepared for each of the serial
   dilutions. This study has been approved by the Research Ethics Office
   of McGill University as described above.

LC-MS and MS2 analysis

   Metabolic profiling at the MS1 level was performed on a UHPLC system
   (Thermo Scientific™ UltiMate™ 3000 System). A hydrophobic column
   (Hypersil GOLD™ aQ C18 Polar Endcapped HPLC Column, 100 mm × 2.1 mm,
   1.9 μm) and a hydrophilic (Accucore™ 150 Amide HILIC HPLC Column,
   100 mm × 2.1 mm, 2.5 μm) column were used for reverse phase (C18
   column) and hydrophilic interaction liquid chromatography (HILIC
   column) separation, respectively. The chromatogram system was coupled
   to a Thermo Scientific Q-Exactive Orbitrap mass spectrometer.

   The chromatographic conditions for the C18 and HILIC columns were
   optimized as follows. For both columns, the flow rate was fixed at
   0.4 mL/min. For the C18 columns, the composition of the mobile phases A
   and B was 0.1% FA in water and 0.1% FA in ACN, respectively. For HILIC
   chromatography, the composition of the mobile phases A and B were 50%
   ACN in water with 5 mmol/L NH[4]AC and 95% ACN in Water with 5 mmol/L
   NH[4]AC, respectively. The gradient procedures and other instrumental
   parameters are provided in Supplementary Table [172]9.

   The Q-Exactive Orbitrap MS was configured as follows. For the C18
   column, an electrospray ion (ESI) source with a spray voltage of 4 keV
   in positive mode and 3.5 keV in negative mode were used, and for HILIC
   a voltage of 4 keV in positive mode and 3.8 keV in negative mode were
   used. Additional MS parameters were set for the C18 and HILIC columns,
   which are summarized in Supplementary Table [173]10. Both positive
   (ESI^+) and negative (ESI^−) ion modes were adopted for ion
   acquisition.

   MS2 was performed immediately after the LC-MS1 experiment with the
   corresponding mode (C18-ESI^+, C18-ESI^−, HILIC-ESI^+, or HILIC-ESI^−).
   The chromatographic conditions were the same as the ones detailed in
   the LC-MS section, while the mass spectrometer was specifically
   configured for untargeted DDA, DIA, and iterative targeted DDA,
   respectively. DIA was performed with sequential window acquisition of
   all theoretical fragment ion mass spectra (SWATH-MS) strategy. All
   parameters for the MS2 acquisitions are summarized in Supplementary
   Table [174]10.

   Untargeted DDA was performed to detect the top 10 ions with the highest
   intensities of each full MS scan. Immediately after the acquisition of
   MS1 and untargeted DDA, SWATH was performed. Each cycle of SWATH
   consisted of a full MS scan and 10 MS2 windows with different window
   sizes. The m/z size of the MS2 window was determined based on the
   general distribution of metabolic features at the MS1 level. The
   adjacent windows were sequentially used for DIA MS2 detection. The
   windows overlapped with their neighbors at 1.0 m/z. The design of SWATH
   windows is summarized in Supplementary Table [175]11. Approximately
   0.9 s elapsed in total for each SWATH cycle.

   The sample type-specific metabolic features were extracted based on the
   results of MS1 and used as the inclusion list. Iterative targeted DDA
   was executed with the updated inclusion list (optimized using
   HERMES^[176]60) to exhaust the targets as much as possible^[177]61. In
   detail, the sample-specific ions were input as the acquisition targets
   for DDA. Once the 1st round of DDA was finished, the detected ions were
   excluded from the target inclusion list. Then, the 2nd round of
   targeted DDA was performed with the updated inclusion list until the
   targets or samples were exhausted. All raw spectra data were deposited
   to the Metabolomics Workbench (accession ID: ST002796 and ST002798).

MS2 reference spectra library curation

   A total of nine public MS2 databases were collected and curated. HMDB
   databases were downloaded directly from its website as xml files. Four
   tables (HMDB_experimental_NegDB, HMDB_experimental_PosDB,
   HMDB_predicted_NegDB, HMDB_predicted_NegDB) were generated from the
   HMDB database. MoNA series and LipidBlast database were downloaded from
   MassBank of North America as msp files. Nine tables (MoNA_PosDB,
   MoNA_NegDB, ReSpect_PosDB, ReSpect_NegDB, VaniyaNP_PosDB, Vaniya_NegDB,
   BMDMS_PosDB, LipidBlast_PosDB and LipidBlast_NegDB) were generated.
   MassBank database was downloaded from its website as msp files. Four
   tables were generated (RIKEN_PosDB, RIKEN_NegDB, MassBank_PosDB,
   MassBank_NegDB). GNPS database was downloaded from the GNPS website as
   msp files. Two tables (GNPS_PosDB, GNPS_NegDB) were generated. MINEs
   database was downloaded from the MINEs website in msp format. Two
   tables were curated (MINEs_PosDB, MINEs_NegDB) from it. These
   downloaded files were curated with in-house R scripts into SQLite
   format based on the same schema, which is named “Complete library”.
   Detailed information regarding downloading links, database versions,
   and access dates are available in Supplementary Table [178]12. Based on
   the “Complete library”, we derived four specific MS2 reference
   libraries (Pathway library, Biology library, Lipid library, and
   Exposomics library) as described below.

   The pathway library was mainly curated according to the KEGG pathway
   information. KEGG pathways of 120 species (including common model
   species and pathogenic microorganisms) were downloaded with
   KEGGREST^[179]62. All compounds from the metabolic pathways of all
   species were extracted as a compound list. A total of 3456 compounds
   were included (supplementary Fig. [180]28). All MS2 records in the
   “Complete library” matching these compounds were extracted as a
   “Pathway library”. Similarly, Biology library was curated based on the
   compound information from KEGG^[181]17 compound database and
   HMDB^[182]31. All compounds and glycans from KEGG are summarized as
   compound list 1. All compounds in HMDB labeled as “Serum”, “Urine”,
   “Sweat”, “Saliva”, “Feces” and “Cerebrospinal Fluid” were summarized as
   compound list 2. Compound lists 1 and 2 were merged as a target list.
   All MS2 records in the “Complete library” matching these compounds were
   extracted as a “Biology library”. The lipid library was curated based
   on the compound information from LIPIDMAPS, LipidBank^[183]34,
   LipidBlast^[184]33, and HMDB database^[185]63. All lipids from these
   databases were summarized as a lipid list. All MS2 records in the
   “Complete library” matching these compounds were extracted as a “Lipid
   library”. All lipids in this library are classified into superclasses,
   main classes, and sub-classes based on RefMet^[186]64. Exposomics
   library was curated from KEGG Drug database^[187]65, Microbial
   Metabolites Database^[188]66, Toxin-Toxin-Target Database
   (T3DB)^[189]67, FooDB ([190]www.foodb.ca), Phenol-Explorer^[191]68,
   Exposome-Explorer^[192]69, and NORMAN Suspect List Exchange
   database^[193]70. All compounds from these databases were extracted as
   exposomics compound list. All MS2 records in the “Complete library”
   matching these compounds were separately extracted as an “Exposomics
   library”.

   Five neutral loss spectra databases have also been pre-calculated,
   corresponding to the five options above. The curation of neutral loss
   databases was based on the algorithm implemented by METLIN neutral loss
   database^[194]30. Briefly, neutral loss spectra were calculated by
   deducting the m/z from the precursor ion as the neutral loss ion. The
   intensity values were directly mirrored. The scheme of all MS2
   reference libraries is shown in Supplementary Fig. [195]29.

DDA data deconvolution

   The first step of DDA spectra deconvolution is to assign all MS2
   spectra to individual target features based on the precursor
   information. If users provide a target feature list, MetaboAnalystR can
   automatically perform MS2 data processing. Otherwise, the complete MS1
   features will be used for MS2 data processing. By default, the MS1
   features list generated by MetaboAnalystR includes minimum and maximum
   values for m/z and retention time. If m/z and retention time are not
   provided as in ranges, the m/z and retention time ranges will be
   calculated automatically based on tolerance values defined by users
   (ppm for m/z, and rt_tol for retention time). If there are multiple MS2
   spectra assigned to an individual target MS1 feature, the spectra would
   be merged in a weighted manner as developed in MZmine^[196]9. The
   median m/z (mz_med) and median retention time (rt_med) are extracted or
   calculated based on the MS feature’s information. Then the nearest MS1
   is extracted for the following analysis. If there are multiple
   different centroids within the (mz_med centered) isolation window, and
   any of their intensity values are over the acquisition threshold
   (user-defined), the spectrum is considered as “chimeric” to be
   deconvolved in the next steps. The centroid ion corresponding to the MS
   feature is considered as “main ion”, others are considered
   “contamination ions”.

   The purpose of deconvolution is to remove the fragments produced from
   “contamination ions” in the chimeric spectrum (“Spectrum 0”) and
   generate a clean deconvoluted spectrum for the “main ion”. Technically,
   the “main ion” is the ion of target feature, while the “contamination
   ion” may come from multiple sources, such as isobaric ions, orphan
   isotopologues^[197]18, other known or unknown ions with their m/z
   values falling into the isolation windows, etc. In the second step,
   MetaboAnalystR scans through all contamination ions and determines if
   they are orphan isotopologues. If any of them is identified as orphan
   isotopologues based on the MS1 scan, the spectrum of this orphan
   isotopologue is predicted with the method from DecoID^[198]18. We name
   the predicted spectrum of the orphan isotopologue as “Candidate
   Spectrum I”. Then, if there is any ion that has been detected and
   identified as a clean spectrum in one or two nearest MS2 scans inside
   the data itself, the clean spectrum is also extracted for
   deconvolution, named “Candidate Spectrum II”. Next, MetaboAnalystR
   extracts potential spectra from the MS2 library as the reference
   spectrum for the ions, which are neither orphan isotopologues nor the
   ones with clean spectra included by the spectra data itself. All
   spectra from the reference are extracted, and the one showing the
   highest similarity to the original chimeric spectrum (“Spectrum 0”) is
   retained as “Candidate Spectrum III”. The spectral similarity is
   evaluated with dot product^[199]18 or spectral entropy^[200]29
   similarity methods based on users’ preferences. If all ions in this