Abstract

   Untargeted metabolomic analysis using mass spectrometry provides
   comprehensive metabolic profiling, but its medical application faces
   challenges of complex data processing, high inter-batch variability,
   and unidentified metabolites. Here, we present DeepMSProfiler, an
   explainable deep-learning-based method, enabling end-to-end analysis on
   raw metabolic signals with output of high accuracy and reliability.
   Using cross-hospital 859 human serum samples from lung adenocarcinoma,
   benign lung nodules, and healthy individuals, DeepMSProfiler
   successfully differentiates the metabolomic profiles of different
   groups (AUC 0.99) and detects early-stage lung adenocarcinoma (accuracy
   0.961). Model flow and ablation experiments demonstrate that
   DeepMSProfiler overcomes inter-hospital variability and effects of
   unknown metabolites signals. Our ensemble strategy removes
   background-category phenomena in multi-classification deep-learning
   models, and the novel interpretability enables direct access to
   disease-related metabolite-protein networks. Further applying to lipid
   metabolomic data unveils correlations of important metabolites and
   proteins. Overall, DeepMSProfiler offers a straightforward and reliable
   method for disease diagnosis and mechanism discovery, enhancing its
   broad applicability.

   Subject terms: Machine learning, Metabolomics, Cancer metabolism
     __________________________________________________________________

   Untargeted metabolomic analysis provides comprehensive metabolic
   profiling but faces challenges in medical application. Here, the
   authors present an explainable deep learning method for end-to-end
   analysis on raw metabolic signals to differentiate metabolomic profiles
   of cancers with high accuracy.

Introduction

   Metabolomics offers a comprehensive view of small molecule
   concentrations within a biological system and plays a pivotal role in
   the discovery of disease biomarkers for diagnostic purpose^[50]1.
   Liquid chromatography mass spectrometry (LC-MS) is a widely practiced
   experimental technique in global metabolomic studies^[51]2,[52]3. High
   sensitivity, stability, reproducibility, and detection throughput are
   the unique advantages of untargeted LC-MS^[53]4. Despite its capacity
   to measure thousands of ion peaks, the conventional metabolomic study
   by LC-MS remains a challenging task due to laborious data processing
   such as peak picking and alignment, metabolite annotation by comparing
   to authenticated databases, and data normalisation to control unwanted
   variability in a large-scale study^[54]5,[55]6. The broader application
   of metabolomics in precision medicine may be impeded by obstacles such
   as complex data processing, high inter-batch variability, and
   burdensome metabolite identification^[56]7.

   Untargeted metabolomics has been conducted on various human biological
   fluids, including serum and plasma, for the discovery of biomarkers in
   cancers such as hepatocellular carcinoma^[57]8, pancreatic^[58]9,
   prostate^[59]10, and lung cancers^[60]11–[61]13. However, such
   biomarker discovery studies utilising metabolomics face significant
   challenges regarding reproducibility, likely due to signal drifts in
   cross-batch or cross-platform analysis^[62]14 and the limited
   integration of data from different laboratory samples^[63]15.
   Furthermore, unknown metabolites are excluded when comparing detected
   features to authenticated databases^[64]16, which may hinder our
   ability in discovering new biomarkers associated with diseases. Several
   previous studies have combined machine learning with LC-MS for in vitro
   disease diagnosis and improved the efficiency of LC-MS data analysis.
   For example, Huang et al. conducted machine learning to extract serum
   metabolic patterns from laser desorption/ionisation mass spectrometry
   to detect early-stage lung adenocarcinoma^[65]11; Chen et al. adopted
   machine learning models to conduct targeted metabolomic data analysis
   to identify non-invasive biomarker for gastric cancer diagnosis^[66]17;
   Shen et al. developed a deep-learning-based Pseudo-Mass Spectrometry
   Imaging method and applied it in the prediction of gestational age of
   pregnant women, as well as the diagnosis of endometrial cancer and
   colon cancer^[67]18. However, these studies still face challenges such
   as batch effects and unknown metabolites in metabolomics^[68]7.
   Consequently, a new analytical approach is urgently needed to overcome
   the experimental bottlenecks and reveal disease-associated profiles
   comprising both identified and unknown components derived from LC-MS
   peaks.

   Deep learning has been widely applied in various omics data analyses,
   holding promise for addressing the complexities of metabolomic
   data^[69]19. The encoding and modelling capabilities of deep learning
   offer a potential solution to overcome the aforementioned bottlenecks
   in handling intricate and high-dimensional data, mitigating bias in
   machine learning algorithms^[70]20,[71]21. However, deep learning
   necessitates high-quality data and a sufficient quantity of samples,
   otherwise leading to issues like the curse of dimensionality and the
   overfitting of predictive models^[72]22. Moreover, integrating large
   dataset collected from multiple hospitals may introduce significant
   variations. Furthermore, as deep learning methods are usually perceived
   as “black-box” processes, the importance of model interpretability for
   prediction in the context of biomedical research is increasingly
   recognised^[73]23–[74]25. Therefore, a deep learning model with both
   interpretability for biological soundness and capability to mitigate
   batch effects is highly desirable to enhance the reliability of
   large-scale metabolomic analyses for diagnostic purposes.

   In this study, we develop an ensemble end-to-end deep learning method
   named as deep learning-based mass spectrum profiler (DeepMSProfiler)
   for untargeted metabolomic data analysis. We firstly apply this method
   to differentiate healthy individuals and patients with benign lung
   nodules or lung adenocarcinoma using 859 serum samples from three
   distinct hospitals, followed by its extended analysis on lipid
   metabolomic data derived from 928 cell lines to reveal metabolites and
   proteins associated with multiple cancer types. Without the process of
   peak extraction and identification as well as potential errors by
   conventional machine learning approaches, our method directly converts
   raw LC-MS data into outputs such as predicted classification, heatmaps
   illustrating key metabolite signals specific to each class, and
   metabolic networks that influence the predicted classes. Importantly,
   DeepMSProfiler effectively removes undesirable batch effects and
   variations across different hospitals and infers the unannotated
   metabolites associated with specific classifications. Furthermore, it
   leverages an ensemble-model strategy that optimises feature attribution
   from multiple individual models. DeepMSProfiler achieved an area under
   the receiver operating characteristic curve (AUC) score of 0.99 in an
   independent testing dataset, along with an accuracy of 96.1% in
   detecting early-stage lung adenocarcinoma. The results are explainable
   through locating relevant biological components as contribution factors
   to prediction. Our method provides a straightforward and reliable
   approach for metabolomic applications in disease diagnosis and
   mechanism discovery.

Results

The overview of the ensemble end-to-end deep-learning model

   The DeepMSProfiler method includes three main components: the
   serum-based mass spectrometry, the ensemble end-to-end model, and the
   disease-related LC-MS profiles (Fig. [75]1a). In the first component,
   the raw LC-MS-based metabolomic data was generated using 859 human
   serum samples (Fig. [76]1a left) collected from 210 healthy
   individuals, 323 benign lung nodules, and 326 lung adenocarcinomas. The
   space of the LC-MS raw data contains three dimensions: retention time
   (RT), mass-to-charge ratio (m/z), and intensity. Using the RT and m/z
   dimensions, the data can be mapped from three-dimensional space into
   the frequency and time domains, respectively (Fig. [77]1b left to
   middle). Ion current maps and primary mass spectra can then be
   generated and used for metabolite identification (Fig. [78]1b middle to
   right). Conventional step-by-step methods of metabolomic
   analysis^[79]5,[80]22 (Supplementary Fig. [81]1 top) may lead to a
   large number of lost metabolic signals. To address these issues,
   DeepMSProfiler directly takes untargeted LC-MS raw data as model input,
   and builds an end-to-end deep learning model to profile disease-related
   metabolic signals (Supplementary Fig. [82]1 bottom).

Fig. 1. The DeepMSProfiler method using LC-MS-based untargeted serum
metabolome.

   [83]Fig. 1
   [84]Open in a new tab

   a The overview of DeepMSProfiler. Serum samples of different
   populations (top left) were collected and sent to the instrument
   (bottom left) for liquid chromatography-mass spectrometry (LC-MS)
   analysis. The raw LC-MS data, containing information on retention time
   (RT), mass-to-charge ratio (m/z), and intensity, is used as input to
   the ensemble model (middle). Multiple single convolutional neural
   networks form the ensemble model (centre) to predict the true label of
   the input data and generate three outputs (right), including the
   predicted sample classes, the contribution heatmaps of
   classification-specific metabolic signals, and the
   classification-specific metabolic networks. b The data structure of raw
   data. The mass spectra of different colours (centre) represent the
   corresponding m/z and ion intensity of ion signal groups recorded at
   different RT frames. All sample points are distributed in a
   three-dimensional space (left) which can be mapped along three axes to
   obtain chromatograms, mass spectra, and two-dimensional matrix data.
   Chromatograms and mass spectra are used for conventional qualitative
   and quantitative analysis (right), while the two-dimensional matrix
   serves as input data for convolutional neural networks. c The structure
   of a single end-to-end model. The input data undergoes the pre-pooling
   processing to reduce dimensionality and become three-channel data. As
   the model passes through each convolutional layer (conv) in the feature
   extractor module, the weights associated with the original signals
   change continuously. The sizes of different frames in the enlarged
   layers (top) represent different receptive fields, with DenseNet
   allowing the model to generate more flexible receptive field sizes.
   After the last fully connected layer (FC), the classifications are
   resulted.

   The main model adopts an ensemble strategy and consists of multiple
   sub-models (Fig. [85]1a middle). The ensemble strategy is considered to
   be able to provide better generalisation^[86]26 according to the
   complexity of the hypothesised space^[87]27, the local optimal
   search^[88]27 and the parameter diversity^[89]28,[90]29 by the random
   training process, as well as the bias-variance theory^[91]30–[92]32.
   The structure of each sub-model consists of three parts: a pre-pooling
   module, a feature extraction module, and a classification module
   (Fig. [93]1c). The pre-pooling module transforms three-dimensional data
   into two-dimensional space through a max-pool layer, effectively
   reducing dimensionality and redundancy while preserving global signals
   (Fig. [94]1c left). The feature extraction module is based on a
   convolutional neural network to perform classification tasks by
   extracting category-related features (Fig. [95]1c middle). DenseNet121
   is chosen as the backbone of the feature extraction module due to its
   highest accuracy and the least number of parameters (Supplementary
   Data [96]1). Additionally, the design of densely connected
   convolutional networks in DenseNet121^[97]33 makes the model more
   flexible in terms of receptive field size, allowing adaptation to
   different RT intervals of metabolic signal peaks. After the non-linear
   transformation by the feature extraction layer, different weights could
   be assigned to the input original signal peaks for subsequent
   classification. We evaluated the effect of the number of sub-models on
   improving performance and consequently chose to use 18 sub-models for
   DeepMSProfiler (Supplementary Fig. [98]2). The classification module
   implements a simple dense neural network to compute the probabilities
   of different classes (Fig. [99]1c right).

   In the third component, each input sample results in three outputs from
   the model, including the predicted classification of the sample, the
   heatmap for the locations of the key metabolic signals, and the
   metabolic network that influences the predicted category (Fig. [100]1a
   right). In the predicted classification, the category with the highest
   probability is assigned as the predicted label of the model. In the
   heatmap presentation, the key metabolic signals associated with
   different classifications are inferred by the perturbation-based
   method, and the m/z and RT of the key metabolic signals can be located.
   Finally, the model infers the underlying metabolite-protein networks
   and metabolic pathways associated with these key metabolic signals
   directly from m/z (see Methods).

Model performance metrics

   To test DeepMSProfiler’s ability in classifying disease states and
   discovering disease-related profiles, we performed the global
   metabolomic analysis of serum samples from healthy individuals and
   patients with benign lung nodules or lung cancer. We collected serum
   samples from three different hospitals to construct and validate the
   model. Benign nodule cases were followed for up to 4 years and lung
   adenocarcinoma samples were pathologically examined (see Methods). We
   built the model using 859 untargeted LC-MS samples, of which 686 as the
   discovery dataset and 173 as the independent testing dataset
   (Fig. [101]2a). The samples were generated from 10 batches, as shown in
   Supplementary Table [102]1. Statistics analysis shows a significant
   correlation between lesion size and disease type, but the correlations
   between other clinical factors are not significant (Supplementary
   Table [103]2). To avoid confounding effects by clinical features, we
   performed further distribution statistics analysis, which shows no
   significant difference in the distribution of lesion diameter and
   patient age in both the discovery dataset and the independent testing
   dataset (Supplementary Fig. [104]3). The samples in the discovery
   dataset were randomly divided into two subsets: a training dataset
   (80%) for parameter optimisation and a validation dataset (20%) to
   cross-validate the performance of different models. Remarkably, the
   accuracies of our DeepMSProfiler in the training dataset, validation
   dataset, and independent testing dataset are 1.0, 0.92, and 0.91,
   respectively (Supplementary Table [105]3).

Fig. 2. The prediction performance of DeepMSProfiler.

   [106]Fig. 2
   [107]Open in a new tab

   a The sample allocation chart. The outer ring indicates the types of
   diseases and the inner ring indicates the sex distribution. Healthy:
   healthy individuals; Benign: benign lung nodules; Malignant: lung
   adenocarcinoma. b Predicted receiver operating characteristic (ROC)
   curves of different methods. Random: performance baseline in a random
   state. Comparison of performance metrics of different methods (n = 50):
   accuracy (c), precision (d), recall (e), and F1 score (f). The blue
   areas show the different conventional analysis processes using machine
   learning methods, and the red areas display different end-to-end
   analysis processes using deep learning methods. The boxplot shows the
   minimum, first quartile, median, third quartile and maximum values,
   with outliers as outliers. g Model accuracy rates for different age
   groups. The sample sizes for different groups are 52, 69, 40, and 12,
   respectively. h Model accuracy rates for different lesion diameter
   groups. The sample sizes for different groups are 27, 37, 18, 13, and
   34, respectively. The boxplot shows the minimum, first quartile,
   median, third quartile and maximum values. i Prediction accuracy and
   parameter scale of different model architectures. j The confusion
   matrix of the DeepMSProfiler model. The numbers inside the boxes are
   the number of matched samples between the true label and the predicted
   label. The ratio in parentheses is the number of matched samples
   divided by the number of all samples of the true label.

   In the independent testing dataset, DeepMSProfiler significantly
   outperforms traditional methods and single deep learning models
   (Fig. [108]2b–h). Compared with Support Vector Machine (SVM), Random
   Forest (RF), Deep Learning Neural Network (DNN), Adaptive Boosting
   (AdaBoost), and Extreme Gradient Boosting (XGBoost) based on
   traditional methods, and Densely Connected Convolutional Networks
   (DenseNet121) using raw data, DeepMSProfiler presents the highest areas
   under the curve (AUC) of 0.99 (Fig. [109]2b). Notably, DeepMSProfiler
   exhibits higher specificity than other models while maintaining high
   sensitivity, indicating its ability to accurately identify true
   negatives (Supplementary Table [110]4).

   Regarding overall performance against other models, our model achieves
   the best performance in multiple evaluation metrics: accuracy of 95%
   (95% CI, 94%–97%) (Fig. [111]2c), precision of 96% (95% CI, 94%–97%)
   (Fig. [112]2d), recall of 95% (95% CI, 94%–96%) (Fig. [113]2e), and F1
   of 98% (95% CI, 97%–98%) (Fig. [114]2f). Compared to XGBoost, our model
   performs better in different groups of lesion sizes and ages
   (Fig. [115]2g, h), except for samples from patients over 70 years old.
   DeepMSProfiler is also superior to commonly used single deep learning
   models, such as DenseNet121 (Fig. [116]2i). When using the ensemble
   strategy, we did not set different weights for each sub-model, so each
   sub-model is equally involved in the final prediction. The confusion
   matrices of prediction performance for each sub-model (Supplementary
   Fig. [117]4) show their contributions to the overall results. The
   robust performance, coupled with the efficiency in terms of
   computational resources, makes DeepMSProfiler a promising choice for
   the classification tasks (Fig. [118]2i).

   Furthermore, DeepMSProfiler demonstrates consistent performance across
   different categories. All of the AUCs for lung adenocarcinoma, benign
   lung nodules, and healthy individuals achieves 0.99 (Supplementary
   Fig. [119]5), and their respective classification accuracies are 85.7%,
   90.8%, and 97.0% (Fig. [120]2j). Most importantly, our model has good
   performance for detecting stage-I of lung adenocarcinoma with an
   accuracy of 96.1%, indicating its potential as an effective method for
   early lung cancer screening.

Insensitivity to batch effects

   Batch effect is one of the most common error sources for the analysis
   of metabolomic data. To evaluate the impact of batch effects on the
   non-targeted LC-MS data, we first generated 3 biological replicates as
   reference samples for each of the 10 batches. These reference
   replicates were taken from a mixture of 100 healthy human serum
   samples, and each of them contains equal amounts of isotopes including
   ^13C-lactate, ^13C[3]-pyruvate, ^13C-methionine, and ^13C[6]-isoleucine
   (see Methods). The differences in the data structure of reference
   replicates from different batches are visualised in 3D and 2D
   illustrations (Fig. [121]3a and Supplementary Fig. [122]6), which
   indicate the changes of shapes and area as well as the RT shifts among
   different batches. Comparison of individual isotopic peaks in samples
   from different batches also shows that the batch effects are mainly in
   the form of RT shifts, and the differences in peak shapes and areas for
   the same metabolites (Fig. [123]3b).

Fig. 3. Explainable deep learning method avoids limitations of batch effects
in conventional methods.

   [124]Fig. 3
   [125]Open in a new tab

   a Batch effects in 3D point array and 2D mapped heatmap of reference
   samples. RT: retention time; m/z: mass-to-charge ratio. b Isotope peaks
   of the same concentration in different samples. Different colours
   represent the batches to which the samples belong. c The visualisation
   of dimensionality reduction of normalisation by the Reference Material
   method. Below: different colours represent different classes; Above:
   different colours indicate different batches. Healthy: healthy
   individuals; Benign: benign lung nodules; Malignant: lung
   adenocarcinoma. d The visualisation of dimensionality reduction for the
   output data of the hidden layers in DeepMSProfiler. Conv1 to Conv5 are
   the outputs of the first to the fifth pooling layer in the feature
   extraction module. Block4 and Block5 are the outputs of the fourth and
   fifth conv layers in the fifth feature extraction module. Upper:
   different colours indicate different sample batches; Lower: different
   colours represent different population classes. e Correlation of the
   output data of the hidden layer with the batch and class information in
   DeepMSProfiler. The horizontal axis represents the layer names. Conv1
   to Conv5 are the outputs of the first to the fifth pooling layer in the
   feature extraction module. Block10 and Block16 are the outputs of the
   tenth and sixteenth conv layers in the fifth feature extraction module.
   The blue line represents the batch-related correlations, and the orange
   line illustrates the classification-related correlations. f The
   accuracy rates of traditional methods (blue), corrected methods based
   on reference samples (purple), and DeepMSProfiler (red) in independent
   testing dataset (n = 50). The boxplot shows the minimum, first
   quartile, median, third quartile and maximum values, with outliers as
   outliers.

   We then investigated the batch effect corrections and compared the
   performance between DeepMSProfiler and conventional correction methods
   (see Methods). As shown in Fig. [126]3c, after correction by the
   Reference Material (Ref-M) method^[127]34, we still observed 3 clusters
   in the principal component analysis (PCA) profiles, which represent the
   sample dots from three different hospitals. While Ref-M effectively
   addresses the batch effect within samples from the same hospital, the
   residual variation across hospitals remains (Fig. [128]3c). Samples of
   batch 1–6 and 9–10 were obtained from the Sun Yat-Sen University Cancer
   Centre, and samples of batch 8 came from the First Affiliated Hospital
   of Sun Yat-Sen University. Samples of batch 7 were a mixture from three
   different hospitals. Among them, lung cancer samples came from the
   Affiliated Cancer Hospital of Zhengzhou University, lung nodule samples
   came from the First Affiliated Hospital of Sun Yat-Sen University, and
   healthy samples came from the Sun Yat-Sen University Cancer Centre. 100
   healthy human serum samples used as reference during conventional
   procedures were all from the Sun Yat-Sen University Cancer Centre,
   which might be the main reason why it is difficult to correct batch
   effect for batch 7–9. To illustrate the DeepMSProfiler’s end-to-end
   process in the automatic removal of batch effects, we extracted the
   output of the hidden layer and visualised the flow of data during the
   forward propagation of the network (see Methods). From the input layer
   to the output layer, the similarity between different batches becomes
   progressively higher, while the similarity between different types
   becomes progressively lower (Fig. [129]3d and Supplementary
   Fig. [130]7). DenseNet121 is a deep neural network with 431 layers, of
   which 120 are convolutional layers (Supplementary Data [131]1). The
   fourth and fifth layers refer to the output of the fourth dense
   connected module and the output of the fifth dense connected module.
   There are 112 layers between them. Figure [132]3d and Supplementary
   Fig. [133]7 illustrate the intermediate change process from the output
   of the fourth closely connected module to the output of the fifth
   closely connected module. When the batch effects are removed, the
   classification becomes clearer. We further quantified this process of
   change using different metrics that measure the correlation between the
   PCA clusters and the given labels i.e., K-nearest neighbour batch
   effect test score^[134]35, local inverse Simpson’s index^[135]36,
   adjusted rand index (ARI), normalised mutual information (NMI), and
   average silhouette coefficient (ASC) (see Methods). The closer to the
   output layer, the less relevant the data is to batch labels and the
   more relevant to class labels (Fig. [136]3e). This explains how the
   batch effect removal is achieved via progress through hidden layers
   (Fig. [137]3d). This capability might be gained via the supervised
   learning. Our findings suggest that in the forward propagation process,
   the DeepMSProfiler model excludes batch-related information from the
   input data layer by layer, while retaining class-related information.

   Further, we compared the performance of our deep learning method
   against machine learning methods with and without batch effect
   correction. The correction of batch effects could improve accuracies
   when using machine learning classifiers such as SVM, RF, AdaBoost, and
   XGBoost. However, DeepMSProfiler, without any additional manipulation,
   surpassed the machine learning methods with or without batch effect
   correction in terms of prediction accuracy (Fig. [138]3f).

The impact of unknown mass spectrometric signals

   To investigate the impact of unknown metabolite signals on
   classification prediction, we first performed the conventional analysis
   with peak extraction and metabolite annotation using existing databases
   such as Human Metabolome Database (HMDB) and Kyoto Encyclopedia of
   Genes and Genomes (KEGG) (see Methods). We found that 83.5% of all
   detected features remain as unknown metabolites (Fig. [139]4a). The
   absence of these unknown metabolites undermines the prediction accuracy
   (Fig. [140]4b), indicating this large number of unknown metabolites may
   impose a significant impact on classification performance. One of the
   advantages of our approach over traditional methods is the ability to
   retain complete metabolomic features including the unknown metabolites.

Fig. 4. The unknown space in databases and previous studies.

   [141]Fig. 4
   [142]Open in a new tab

   a Statistics of annotated metabolite peaks. Blue colour represents all
   peaks, orange, purple and while yellow colours indicate metabolites
   annotated in HMDB, KEGG, and all databases, respectively. The overlap
   between orange and purple includes 414 metabolites annotated in both
   HMDB and KEGG. HMDB: Human Metabolome Database; KEGG: Kyoto
   Encyclopedia of Genes and Genomes. b The feature selection plot
   illustrates the effect of different contribution score thresholds
   removing unknown metabolites versus non-removing. The horizontal axis
   represents the change in threshold, while the vertical axis shows the
   accuracy of the model using the remaining features. The shadings of
   solid lines (mean) represent error bars (standard deviation). c
   Collection standard of published lung cancer serum metabolic biomarker.
   SCLC: Small Cell Lung Cancer; LUSC: Lung Squamous Cell Carcinoma; LUAD:
   Lung Adenocarcinoma; NAR: Nuclear Magnetic Resonance; MS: Mass
   Spectrometry. d The number counts of known biomarkers published in the
   current literature. e Molecular weight distribution plot of known
   biomarkers. f Accuracy comparison between the ablation experiment and
   DeepMSProfile (n = 50). The boxplot shows the minimum, first quartile,
   median, third quartile and maximum values. In the ablation experiment,
   we investigated the effect of varying the publication count (PC) of
   known biomarkers in the literature. Specifically, we eliminated
   metabolic signals that were not reported in the original data based on
   the m/z of known biomarkers. We retained only the metabolic signals
   with publication counts greater than 1, greater than 3, and greater
   than 8 for modelling development. All ablated data was analysed using
   the same architecture as the original unprocessed data in the same
   DeepMSProfiler architecture. The vertical axis shows the accuracy of
   models built on the dataset of different publication counts and our
   DeepMSProfiler.

   We then tested the limitation of lung cancer prediction using
   biomarkers identified by annotated metabolites. We collected 826
   biomarkers for lung cancer based on serum mass-spectrometry analysis
   from 49 publications (Fig. [143]4c). We deduced the molecular weight of
   the biomarkers from the HMDB database and the information in these
   publications (see Methods). Only 42.7% of the biomarkers discovered
   based on traditional methods appear in more than two articles, and
   their reproducibility is suboptimal (Fig. [144]4d). In addition, the
   molecular weights of these metabolites are mainly distributed in the
   range of 200–400 Da (Fig. [145]4e). We found that prediction
   performance of DeepMSProfiler using complete raw data is highly
   accurate compared with the one using only the corresponding m/z signals
   of the reported biomarkers (Fig. [146]4f). This indicates that there
   are still unknown metabolomic signals in the serum samples related to
   lung cancer that has not been unveiled in the current research. In
   contrast, DeepMSProfiler derives the classifications directly from the
   complete signals in the raw data.

Explainability to uncover the black box

   After the model construction based on deep learning, we sought to
   explain the classification basis of the black box model and identify
   the key signals for specific classifications. We adopted a
   perturbation-based method, Randomised Input Sampling for Explanation
   (RISE)^[147]37, to count feature contributions. We slightly modified
   RISE to improve its operational speed and efficiency, and developed a
   method to evaluate the importance of RISE scores in different
   classifications (see Methods).

   Interestingly, we found a “background category” phenomenon in some of
   the single models (Fig. [148]5a). For each class in a single
   tri-classification model of DenseNet121, the classification performance
   gradually deteriorates as the metabolic signals with higher
   contribution scores are removed. However, there is one category that is
   always unaffected by all the features involved in the classification
   decision, while maintaining a very high number of true positives and
   false positives. These findings imply that the tri-classification model
   only predicts the probabilities of two of the categories, and
   calculates the probability of the third category from the results of
   the other two. In other words, the classification-related metabolites
   associated with the “foreground category” negatively contribute to the
   “background category”. Furthermore, the categories used as “background
   category” are not consistent across the different models. We were
   intrigued to test whether this phenomenon occurs exclusively in
   metabolomic data, so we conducted a seven-classification task using the
   Photo-Art-Cartoon-Sketch (PACS) image dataset^[149]38. We observed a
   similar phenomenon in the resultant feature scoring of different models
   (Supplementary Fig. [150]8). This suggests that “background category”
   may generally exist in multi-classification task by single models,
   although its underlying mechanism is currently unclear and may require
   future investigation.

Fig. 5. Feature constructions of prediction and significant metabolic signals
relevant to biological pathways.

   [151]Fig. 5
   [152]Open in a new tab

   a Prediction performance and feature scoring by different single
   models. b Prediction performance and feature scoring by DeepMSProfiler.
   c Heatmap matrices of classification contribution in healthy
   individuals (Healthy), benign lung nodules (Benign), and lung
   adenocarcinoma (Malignant). The horizontal and vertical axes of the
   matrix are the prediction label and the true label, respectively. The
   heatmaps of upper left, the middle one, and the bottom right represent
   the true healthy individuals, the true benign nodules, and the true
   lung adenocarcinoma, respectively. The horizontal and vertical axes of
   each heatmap are RT and m/z, respectively. The classification
   contributions of metabolites corresponding to true healthy individuals
   (d), benign nodules (e), and lung adenocarcinoma (f). The horizontal
   axis represents the retention time and the vertical axis represents the
   m/z of the corresponding metabolites. The colours represent the
   contribution score of the metabolites. The redder the colour, the
   greater the contribution to the classification. Metabolites-proteins
   network for healthy individuals (g), benign nodules (h), and lung
   adenocarcinoma (i). Pathway enrichment analysis using the signalling
   networks for healthy individuals (j), benign nodules (k), and lung
   adenocarcinoma (l) (FDR < 0.05). m/z: mass charge ratio, RT: retention
   time.

   In contrast, the phenomenon of “background category” no longer exists
   when the feature contributions are calculated by our ensemble model. As
   shown in Fig. [153]5b, when we progressively eliminated metabolic
   signals in each category according to their contribution, their
   performance of the ensemble model decreased accordingly. Each
   individual model captures different features that contribute to their
   corresponding classifications, while the ensemble model could combine
   these features to improve the accuracy of disease prediction and reduce
   the possibility of overfitting.

Metabolomic profiles in lung adenocarcinoma, benign nodules, and healthy
individuals

   To analyse the global metabolic differences between lung
   adenocarcinoma, benign nodules, and healthy individuals, we extracted
   the heatmaps of feature contributions counted by RISE from
   DeepMSProfiler (Fig. [154]5c). As shown in Fig. [155]5d–f, the
   horizontal and vertical labels in the heatmaps represent m/z and
   retention time respectively. By mapping the label information to the
   heatmaps, we are able to locate the metabolites corresponding to
   different m/z and retention times to obtain their feature contribution
   scores. In true-positive healthy and benign nodule samples, the
   metabolic signals with the most significant contribution are uniformly
   located between 200 and 400 m/z and in 1–3 min (Fig. [156]5d, e). In
   comparison, the metabolic signals located between 200 and 600 m/z and
   in 1–4 min contribute most in lung adenocarcinoma samples, but signals
   in other regions also have relatively high scores (Fig. [157]5f).

   As higher contribution scores in the heatmaps represent more important
   correlations, we screened signals with scores above 0.70 and attempted
   to identify the corresponding metabolic profiles in each classification
   (Supplementary Data [158]2). As observed in Fig. [159]5b, by retaining
   metabolic signals with a contribution score above 0.7, the overall
   accuracy is around 0.8, which manages to maintain an efficient
   classification impact. Considering the RT shift among different
   batches, we matched metabolic peaks only by m/z. We then fed these m/z
   signals, together with metabolites identified by tandem mass
   spectrometry (MS2), into the analysis tool PIUMet based on
   protein–protein and protein-metabolite interactions^[160]39 to build
   disease-associated feature networks (Fig. [161]5g–i). As the network
   shown in Fig. [162]5i, 82 proteins and 121 metabolites are matched in
   the lung adenocarcinoma samples, including 9 already identified by MS2
   and 111 hidden metabolites found by the correlation between key
   metabolic peaks. As such, the current analysis based on protein-protein
   and protein-metabolite interactions allows the discovery of unknown
   metabolic signals associated with diseased states, although the
   resolution of the current model might be relatively low in
   distinguishing all individual peaks contributing to the disease
   classification. In order to explore the biological explainability,
   among the features extracted by PIUMet, we also selected 11 metabolites
   (Supplementary Table [163]5) with available authentic standards in our
   laboratory to justify their presence in the lung cancer serum samples.
   Indeed, these metabolites could be identified in the lung cancer serum
   as described in our previous study^[164]40. We further analysed the
   metabolic networks to explore the biological relevance associated with
   each classification. The heatmaps (Fig. [165]5d, e) and pathway
   analysis (Fig. [166]5j, k) consistently show that healthy individuals
   and benign nodules share similar metabolic profiles. In contrast, the
   cancer group presents a distinct profile with specific pathways and
   increasing counts of metabolites or proteins in the shared pathways
   with healthy individuals or benign nodules (Fig. [167]5f, l). The
   detailed metabolites and protein candidates were further shown in
   Supplementary Data [168]3. Taken together, our network and pathway
   analyses demonstrated the interpretability of DeepMSProfiler based on
   deep learning.

Application of model in colon cancer

   Considering the transferability of DeepMSProfiler, we obtained a public
   colon cancer LC-MS dataset which contained 236 samples from
   MetaboLights (ID is MTBLS1129). There are 197 colon cancer samples and
   39 healthy human samples in the dataset. We randomly divided this
   dataset into discovery dataset and independent testing dataset at 4:1
   ratio. The discovery dataset contained 157 colon cancer samples and 31
   healthy control samples. The independent testing dataset contained 40
   colon cancer patient samples and 8 healthy control samples.

   Due to the differences in cancer types and mass spectrometry analysis
   procedures between the colon cancer dataset and the lung adenocarcinoma
   dataset, we re-trained the DeepMSProfiler model. The colorectal cancer
   data was randomly divided into a discovery dataset and an independent
   testing dataset, and the discovery dataset was further randomly divided
   into a training dataset and a validation dataset with multiple times.
   In the independent testing dataset of the colon cancer dataset, our
   model achieved an accuracy of 97.9% (95% CI, 97.7%–98.1%), a precision
   of 98.7% (95% CI, 98.6%–98.8%), a recall of 93.4% (95% CI,
   92.9%–94.1%), and an F1 of 95.8% (95% CI, 95.4%–96.2%) (Supplementary
   Fig. [169]9). These results suggest an excellent transferability of
   DeepMSProfiler.

Discovery of metabolic-protein networks in pan-cancer

   In a continued effort to investigate the capabilities of DeepMSProfiler
   in analysing metabolomics data across multiple cancer types, raw lipid
   metabolomic data of 928 cell lines spanning 23 cancer types were
   collected from the Cancer Cell Line Encyclopaedia (CCLE)
   database^[170]2 and then subjected to processing by DeepMSProfiler.
   Notably, in addition to the raw metabolomic data, these cell lines also
   contain valuable data of annotated metabolites, methylation, copy
   number variations, and mutations^[171]2. DeepMSProfiler constructed a
   model encompassing the 23 distinct categories, followed by a feature
   extraction from the 23-category model to identify the respective
   crucial metabolic signals of each category. Due to the limited number
   of samples for many cancer types, particularly for biliary tract,
   pleura, prostate, and salivary gland cancers, each with less than 10
   samples, we did not set a separate independent testing dataset for the
   performance validation. 20 sub-models have been trained, and in each
   sub-model training, 80% of all samples were randomly allocated for
   training to ensure that every sample could contribute to the training
   process, especially for cancer types with very few samples. The final
   ensemble model used for explainable analysis achieved 99.3% accuracy,
   97.2% sensitivity, and 100% specificity. Next, the priority-collecting
   Steiner forest optimisation algorithm^[172]39 was employed to unveil
   the correlation between pivotal metabolic signals and proteins using
   databases of HMDB^[173]41, Recon2^[174]42 and iRefIndex^[175]43 (see
   Methods).

   As results, we successfully generated disease-specific
   metabolite-protein networks (Fig. [176]6a–c) along with a contribution
   score heatmap (Fig. [177]6e), where contribution scores exceeding 0.70
   were considered indicative of disease-specific metabolites. Metabolites
   identified within the metabolite-protein network were directly inferred
   from the mass-to-charge ratio (m/z) of metabolic signals from the raw
   data using feature spectra extracted by the DeepMSProfiler model.
   Notably, we identified 14 metabolites and 3 proteins that exhibited
   co-occurrence within the 23 cancer-related metabolite-protein networks
   (Fig. [178]6d). Finally, we correlated the metabolic data and the
   methylation information and subsequently verified the associations
   between the PLA and UGT gene families and the disease-specific
   metabolites of high contribution (Fig. [179]6f). Previous
   studies^[180]44–[181]48 have reported the important roles of PLA and
   UGT gene families in a variety of diseases, such as PLA2G7 and PLA2G6
   in breast and prostate cancers and neurodegenerative diseases, as well
   as UGT3A2 in head and neck cancers. These evidences support our
   findings by DeepMSProfiler. In summary, our extended analysis spanning
   pan-cancer scenarios highlights the capability of DeepMSProfiler in the
   discovery of potential disease-associated metabolites and proteins.

Fig. 6. Metabolite and protein associations across 23 cancer types.

   [182]Fig. 6
   [183]Open in a new tab

   Metabolite-protein networks for (a) lung cancer, (b) gastric cancer,
   and (c) leukaemia. Yellow squares: metabolites. Red circles: proteins.
   Blue labels: metabolites and proteins shared in 23 cancer
   metabolite-protein networks. d Metabolites and proteins shared in the
   metabolite-protein networks of 23 cancer types. e Heatmap of the
   classification contribution of different lipid metabolites across 23
   cancer types. f Correlation of important pan-cancer-related metabolites
   with methylation of the PLA and UGT gene families.

Discussions

   Metabolomics faces challenges in precision medicine due to complex
   analytical process, metabolic diversity, and database
   limitations^[184]5,[185]6. DeepMSProfiler starts with raw untargeted
   metabolomic data and retains essential information, enabling more
   effective global analysis. It offers an alternative approach by
   directly processing raw data of metabolomic samples, bypassing
   time-consuming experiments such as quality control or reference sample
   preparation and subsequent normalisation analysis.

   In metabolomic study, systematic variations in the measured metabolite
   profiles may occur during sample collection, processing, analysis, or
   even in different batches of reagents or instruments. Batch effects can
   significantly impact the interpretation of the results, leading to
   inconsistencies in replicating findings across different
   studies^[186]15. While batch effects can manifest as variations in
   retention time (RT) offset, peak area, and peak shape, conventional
   quantitative methods often prioritise peak area integration while
   overlooking peak shape^[187]49,[188]50. Significantly, our results
   demonstrate that DeepMSProfiler is able to automatically eliminate
   cross-hospital variations during the end-to-end forward propagation
   process (Fig. [189]3d), effectively revealing classification profiles.

   Moreover, DeepMSProfiler can address the challenges of unidentified
   metabolites. LC-MS metabolomics can reveal tens of thousands of
   metabolite peaks in a biological sample. A substantial number of these
   peaks remains unidentified or unannotated in existing databases. In
   this study, we demonstrated that among all detected peaks, only 16.5%
   are identified by HMDB and KEGG. However, the presence of a significant
   proportion of unknown metabolites has a considerable influence on the
   accuracy of classification (Fig. [190]4b). Indeed, annotating
   metabolomic peaks has remained a major study focus in the
   field^[191]16. A common approach involves comparing the exact mass of
   detected peaks with authenticated standards, along with either the
   retention time or the fragmentation spectra obtained through tandem
   mass spectrometry (MS2). Despite significant development of molecular
   structural databases and MS2 spectral databases, their current
   capabilities and coverage remain limited^[192]51. In addition, network
   analysis, which examines complex peak relationships and clusters, has
   also been developed to facilitate the comprehensive identification of
   metabolites^[193]52. In this study, we employed the deep learning
   method to capture original signals in LC-MS metabolomic analysis
   without compromising data integrity. We further implemented a direct
   transition from m/z to pathway annotations by taking advantage of the
   network-based analysis tool PIUMet^[194]39, effectively identifying 82
   proteins and 121 metabolites in the cancer group, compared with 9
   metabolites annotated by MS2.

   Furthermore, our method is able to cover the metabolites identified by
   conventional annotation and simultaneously uncovers the undetected
   disease-specific features. In the traditional metabolomic analysis,
   biomarkers specific to the disease of interest are usually sought by
   comparison of metabolite levels between control and case samples.
   Therefore, peak alignment and metabolite annotation are crucial to the
   end results. Here, by employing the end-to-end strategy, we unveiled
   the complete biological outputs that contribute to the distinct
   metabolomic profiles of each group. For example, tryptophan metabolism
   was identified among the characteristics of lung adenocarcinoma profile
   (Fig. [195]5l). The result was consistent with our previous discovery
   by the conventional annotation method that metabolites in the
   tryptophan pathway were decreased in the early-stage lung
   adenocarcinoma compared with benign nodules and healthy
   controls^[196]40. Serine and glycine are also important for nucleotide
   synthesis by mediating one-carbon metabolism, which is relevant to
   therapeutic strategy targeting non-small cell lung
   cancer^[197]53–[198]56. Intriguingly, we also observed the contribution
   of bile secretion in the lung adenocarcinoma profile (Fig. [199]5l),
   which aligns with another report of aberrant bile acid metabolism in
   invasive lung adenocarcinoma^[200]57. However, it should be noted that
   the resolution of our model may be limited to distinguish all
   individual peaks contributing to the disease classification.

   We additionally demonstrated that among deep learning models, ensemble
   models are more stable and class-balanced than single models. Although
   we have not fully comprehended the reason for the occurrence of
   “background category”, the ensemble strategy has effectively mitigated
   this phenomenon (Fig. [201]5a, b). Our investigation on the PACS image
   dataset suggests that “background category” may generally exist in
   multi-classification tasks using single models. Understanding its
   underlying mechanism requires further investigation with a broader
   range of dataset.

   The high-resolution heatmaps generated by DeepMSProfiler display the
   feature contributions to the predicted classes and the precise location
   of specific metabolomic signals (Fig. [202]5c), providing explainable
   analysis to assure the researchers of the biological soundness of the
   prediction. With the capability of batch effect removal, comprehensive
   metabolomic profiling, and ensemble strategy, DeepMSProfiler
   demonstrates consistent and robust performance across different
   categories. It achieves AUCs over 0.99 for the predictions of lung
   adenocarcinoma, benign nodules, and healthy samples, and an accuracy of
   96.1% for early-stage (stage-I) lung adenocarcinoma. Moreover, its
   extended analysis in pan-cancer illustrates it ability to uncover
   potential disease-associated metabolites and proteins beyond lung
   cancer. In conclusion, our DeepMSProfiler offers a straightforward and
   reliable method suitable for applications in disease diagnosis and
   mechanism discovery, potentially advancing the use of metabolomics in
   precision medicine. Its effective end-to-end strategy applied to raw
   metabolomic data can benefit a broader population in non-invasive
   clinical practices for disease screening and diagnosis.

Methods

Clinical sample collection

   This study was approved by the Ethics Committees of the Sun Yat-Sen
   University Cancer Centre, the First Affiliated Hospital of Sun Yat-Sen
   University and the Affiliated Cancer Hospital of Zhengzhou University.
   A total of 210 healthy individuals, 323 patients with benign nodules
   and 326 patients with lung adenocarcinoma were enroled. Cases of lung
   adenocarcinoma were collected prior to lung resection surgery and had
   pathological confirmation. Serum from benign nodules was collected from
   individuals undergoing annual physical examinations. Participants with
   benign nodules were defined as those with stable 3–5 years follow-up
   Computed Tomography (CT) scans at the time of analysis. The sample
   collection period was from January 2018 to October 2022. The sex of the
   participants was determined by self-report. Informed consent was
   obtained from all participants. The study design and conduct complied
   with all relevant regulations regarding the use of human study
   participants and was conducted in accordance to the criteria set by the
   Declaration of Helsinki. Research with humans has been conducted
   according to the principles of the Declaration of Helsinki.

   In addition, we collected serum samples from 100 healthy blood donors,
   including 50 males and 50 females, aged between 40 and 55 years, from
   the Department of Cancer Prevention and Medical Examination, Sun
   Yat-Sen University Cancer Centre. All these samples were mixed in equal
   amounts and the resultant mixture was aliquoted and stored. These
   mixtures were used as reference samples for quality control and data
   normalisation in the conventional metabolomic analysis as previously
   described^[203]34.

Serum metabolite extraction

   Fasting blood samples were collected in serum separation tubes without
   the addition of anticoagulants, allowed to clot for 1 h at room
   temperature, and then centrifuged at 2851 × g for 10 min at 4 °C to
   collect the serum supernatant. The serum was aliquoted and then frozen
   at −80 °C until metabolite extraction.

   Reference serum and study samples were thawed and a combined extraction
   method (methyl tert-butyl ether/methanol/water) was used to extract
   metabolites. Briefly, 50 μL of serum was mixed with 225 μL of ice-cold
   methanol and 750 μL of ice-cold methyl-tertbutyl ether (MTBE). The
   mixture was vortexed and incubated for 1 h on ice. Then 188 μL MS grade
   water containing internal standards (^13C-lactate, ^13C[3]- pyruvate,
   ^13C-methionine and ^13C[6]-isoleucine, all from Cambridge Isotope
   Laboratories) was added and vortexed. The mixture was centrifuged at
   15,000 × g for 10 min at 4 °C, and then the bottom phase was
   transferred to two tubes (each containing 125 μL) for LC-MS analysis in
   positive and negative modes. Finally, the samples were dried in a
   high-speed vacuum concentrator.

Untargeted liquid chromatography-mass spectrometry

   The dried metabolites were resuspended in 120 μL of 80% acetonitrile,
   vortexed for 5 min and centrifuged at 15,000 × g for 10 min at 4 °C.
   The supernatant was transferred to a glass amber vial with a micro
   insert for metabolomic analysis. Untargeted metabolomic analysis was
   performed on an ultra-performance liquid chromatography-high resolution
   mass spectrometry (UPLC-HRMS) platform. The metabolites were separated
   using the Dionex Ultimate 3000 UPLC system with an ACQUITY BEH Amide
   column (2.1 × 100 mm, 1.7 μm, Waters). In positive mode, the mobile
   phase comprised 95% (A) and 50% acetonitrile (B), containing 10 mmol/L
   ammonium acetate and 0.1% formic acid. In negative mode, the mobile
   phase was composed of 95% and 50% acetonitrile for phases A and B,
   respectively, both containing 10 mmol/L ammonium acetate and adjusted
   to pH 9. Gradient elution was performed as follows: 0–0.5 min, 2% B;
   0.5–12 min, 2–50% B; 12–14 min, 50–98% B; 14–16 min, 98% B;
   16–16.1 min, 98–2% B; 16.1–20 min, 2% B. The column temperature was
   maintained at 40 °C, and the autosampler was set at 10 °C. The flow
   rate was 0.3 mL/min, and the injection volume was 3 μL. A Q-Exactive
   orbitrap mass spectrometer (Thermo Fisher Scientific) with an
   electrospray ionisation (ESI) source was operated in full scan mode
   coupled with ddMS2 monitoring mode for mass data acquisition. The
   following mass spectrometer settings were used: spray voltage
   +3.8 kV/−3.2 kV; capillary temperature 320 °C; sheath gas 40 arb;
   auxiliary gas 10 arb; probe heater temperature 350 °C; scan range
   70–1050 m/z; resolution 70000. Xcalibur 4.1 (Thermo Fisher Scientific)
   was used for data acquisition.

   In this study, all serum samples were analysed by LC-MS in 10 batches.
   To assess data quality, a mixed quality control (QC) sample was
   generated by pooling 10 μL of supernatant from each sample in the
   batch. Six QC samples were injected at the beginning of the analytical
   sequence to assess the stability of the UPLC-MS system, with additional
   QC samples injected periodically throughout the batch. Serum pooled
   from 100 healthy donors was used as reference material in each batch to
   monitor extraction and batch effects. All untargeted metabolomic
   analysis was performed at the Sun Yat-Sen University Metabolomics
   Centre.

Public dataset collection

   The raw dataset for pan-cancer lipid metabolomics data of CCLE was
   downloaded from the Metabolomics Workbench database with accession
   ST001142

   ([204]https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=
   Study&StudyID=ST001142). There are 946 samples in total, including 23
   cancer types. The quantitative lipid metabolite matrix and the DNA
   methylation matrix were downloaded from the appendix of the
   article^[205]2.

   The LC-MS dataset of colon cancer was downloaded from the MetaboLights
   database
   ([206]https://www.ebi.ac.uk/metabolights/editor/MTBLS1129/descriptors)
   with 236 samples in total, including 197 colon cancer cases and 39
   healthy controls. Due to the differences of disease samples,
   classification purposes, instruments, and parameters of LC-MS between
   the public dataset and the private lung adenocarcinoma dataset, the
   DeepMSProfiler model needs to be re-trained on the public dataset.

Data format conversion

   The raw format files of LC-MS data were converted to mzML format using
   the MSCovert software. The data used to train the end-to-end model were
   sampled directly from the mzML format without any further processing.
   This raw data could be used directly as input to the model. In the mzML
   file, ion intensity and mass-to-charge ratio of each ion point for each
   time interval were recorded. Ions points were mapped into a 3D space by
   their RT and m/z. A 2D matrix was sampled from this 3D points array
   data using a maximally pooled convolution kernel. RT: 0.5 min and m/z:
   50 as the sampling starting point and RT: 0.016 min and m/z: 1 as the
   sampling interval. The sampling ranges of retention time and
   mass/charge ratio were set. Using the sampling interval as a sliding
   window, the maximum ion intensity in the interval was sampled to obtain
   a two-dimensional matrix of 1024 × 1024 ion intensities.

Extraction and annotation of metabolic peaks

   We used Compound Discovery v3.1 and TraceFinder v4.0 (Thermo Fisher
   Scientific) for peak alignment and extraction. These steps resulted in
   a matrix containing retention time, mass-to-charge ratio and peak area
   information for each metabolite. To eliminate potential batch-to-batch
   variation, we used the Ref-M method to correct peak areas. This
   involved dividing the peak area of each feature in the study sample by
   the peak area of the reference compound from the same batch, yielding
   relative abundances. We then used several data annotation tools such as
   MetID, MetDNA and NetID in MZmine to annotate the metabolite features
   and combined the annotation results^[207]52,[208]58–[209]60. These
   analysis tools include mass spectral information from databases such as
   the HMDB, MassBank and MassBank of North America
   (MoNA)^[210]41,[211]61. In addition, we performed data annotation using
   MS2 data based on Compound Discovery v3.1 and finally selected
   bio-compounds with MS2 matches to certified standards or compounds with
   inferred annotations based on complete matches in mzCloud (score > 85)
   or ChemSpider as the precise annotation results^[212]62.

Raw data visualisation

   The mzML files were read using the Pyteomics package in Python. Records
   were traversed for all times in the sampling interval^[213]63. For each
   time index data in mzML files, it recorded the preset scan
   configuration, the scan window, the ion injection time, the intensity
   array, and the m/z array. The intensity array and m/z array were
   selected to form an array of data points, and retention time,
   mass-to-charge ratio, and intensity are the row names. The intensity
   values were log2 processed. Then, the 3D point cloud data was
   visualised using the Matplotlib Toolkits package in Python^[214]64. The
   2D matrixes were obtained by down-sampling the 3D point cloud and
   pooling the 3D data using median and maximum convolution kernels.
   Convolution spans were RT: 0.1 min and m/z: 0.001. Heatmaps and
   contours were plotted using Matplotlib. Retained time-intensity curves
   were also plotted using Matplotlib with an m/z span of 0.003.

Dataset division and assignment

   The dataset of each batch was randomly divided into a training dataset
   and an independent testing dataset in a ratio of 4:1. The data from the
   first to the seventh batch contained 90 samples each, including 30
   healthy individuals, 30 lung nodules, and 30 lung adenocarcinoma
   samples. The data for the eighth and ninth batches did not contain
   healthy samples. The data for the tenth batch only contained nodule
   samples. To avoid the effect of classification imbalance, we
   constrained the same sample type and sex ratio in the training and
   independent testing dataset. Because the samples came from patients of
   different ages and sexes, the lesion sizes of lung nodules and lung
   adenocarcinoma patients also varied. In order to avoid these attributes
   affecting the authenticity of the model, sex, age, and lesion size were
   also used as constraints for dataset division. The difference in the
   distribution of sample attributes between the training dataset and the
   independent testing dataset was verified by an unpaired t-test.

Deep learning model construction in detail

   In this step, we aimed to construct a model to predict the class labels
   for each metabolomic sample. For this, we first set X and Y as the
   input and label spaces, respectively. A single end-to-end model
   consisted of three parts, a dimension converter based on pool
   processing, a feature detector based on the convolutional neural
   networks, and a multi-layer perceptron (MLP) classifier. The input data
   directly from the raw data was extremely large and contained a lot of
   redundant information, so a pooling process was required to reduce the
   resolution for downstream convolution operations. The input data of the
   model was reduced by the maximum pooling layer to obtain D(X). Next,
   enter the feature extractor dominated by convolutional layers to obtain
   F(D(X)). The convolutional neural network had local displacement
   invariance and was well adapted to the RT offset problem in metabolomic
   data. Due to the relatively large size of the data, more than 5 layers
   of convolutional operations were required to reduce the dimensionality
   of the data to the extent that the computing power could be loaded.
   Different architectures were used respectively to compare the
   performance in the tuning set. The architectures used in different
   models included VGGNet (VGG16, VGG19), Inception Model (InceptionV3),
   ResNet (ResNet50), DenseNet (DenseNet121), and EfficientNet
   (EffcientNetB0-B5)^[215]33,[216]65–[217]67. In addition, two
   optimisation models based on Densenet121 were created to simplify the
   DenseNet network. The direct connection route replaced the last dense
   layer of Densenet121 with a convolutional layer. The optimisation route
   replaced the last dense layer of DenseNet with a convolutional layer
   that retained a one-hop connection. The pre-training parameters in
   pre-trained models were derived from ImageNet. Each architecture was
   tested on the TensorFlow + Keras platform and PyTorch platform,
   respectively. To reduce overfitting, we used only one linear layer for
   our MLP layer. In the TensorFlow + Keras model, there was a SoftMax
   activation layer before the output layer. The output of the model was
   C(F(D(X))).

   The positive and negative spectral data used different convolutional
   layers for feature extraction. Their features were combined before
   inputting the fully connected layer. Their pre-training parameters were
   shared. For a model trained on both positive and negative spectral
   data, a cross entropy loss was used.

Model training

   20% of the discovery dataset was divided into tuning sets, which were
   used to guide model architecture selection, hyperparameter selection,
   and model optimisation, and the rest 80% was used for model training.
   Sample category balancing was used as a constraint for dataset
   segmentation. The model architecture was evaluated by both validation
   performance and operational performance. We counted the number of model
   parameters and evaluated the complexity of the model. The average of
   the 10 running times of the models was used as the runtime. Hyperas was
   used to preliminarily select the optimal model hyperparameters and
   initialisation weights^[218]68. The optimal initialisation method was
   he_normal. But we opted for pretraining with the ImageNet dataset due
   to its comparable performance and faster execution. After reducing the
   size of the parameter search, we used the grid search method for
   hyperparameter tuning.

Ensemble strategy

   DeepMSProfiler consists of several trained end-to-end sub-models as an
   ensemble model, where the average of the classification prediction
   probabilities of the samples from all sub-models was used as the final
   prediction probability for classification. The ensemble model
   calculated a score vector of length 3 in each of the three
   classifications, and the category with the maximum score was selected
   as the predicted classification result.

   Each end-to-end sub-model was trained on the discovery dataset. The
   architecture of each sub-model is the same, but some hyperparameters
   are different. Two different learning rates of 1e-3 and 1e-4 were used.
   The optimiser used is ‘adam’ with parameter settings of beta_1 as 0.9,
   beta_2 as 0.999, epsilon as 0.001, decay as 0.0, and amsgrad as False.
   The batch size was set as 8 and the training was run for 200 epochs. A
   model typically took about 2 h to complete training on a GP100GL (16GB)
   GPU server. Each sub-model participated fairly in the final prediction
   result without setting a specific weight. The independent testing
   dataset was not used in model training and hyperparameter selection.

Machine learning models for comparison

   To compare our DeepMSProfiler to other existing tools, we selected
   several common traditional machine learning methods to build
   tri-classification models based on the peak area data obtained from the
   previous steps. These methods included Extreme Gradient Boosting
   (XGBoost), RF, Adaptive Boosting (Adaboost), SVM, and DNN. The training
   dataset and independent testing dataset were divided in the same way as
   the deep learning algorithm, and the numbers of estimators for Adaboost
   and XGBoost algorithms were the same as those of DeepMSProfiler.
   XGBoost was implemented by the XGBClassifier function in the xgboost
   library. Other machine learning methods were implemented using the
   SciKitLearn library. SVM was adopted using the svm function, and the
   kernel of SVM is ‘linear’. RF was implemented through the
   RandomForestClassifier function. Adaboost was adopted through the
   AdaBoostClassifier function. DNN was implemented using the
   MLPClassifier function. The optimal hyperparameter was obtained by the
   grid search method.

Performance metrics

   We evaluated the performance of the model on the independent testing
   dataset. The evaluation metrics included accuracy, precision,
   sensitivity and F1 score. Micro was chosen as the computational method
   for the multiclassification model. Confidence intervals were estimated
   using 1000 bootstrap iterations. During the bootstrapping procedure,
   our model was estimated by an ensemble strategy combining 20 models
   trained on the discovery dataset. In addition, we calculated a
   confusion matrix and an AUC curve to demonstrate the performance of the
   model in the three classifications of lung adenocarcinoma, benign
   nodules and healthy individuals. When the sensitivity was 0.7 or 0.9,
   the specificity was calculated using the sensitivity-specificity curve.
   The sensitivity-specificity curve was interpolated using the NEAST
   method.

Visualisation of “black-box”

   In the end-to-end neural network prediction, the data flowed in a chain
   of
   [MATH: <mi
   mathvariant="bold-italic">X</mi><mo>→</mo><mi>D</mi><mrow><mo>(</mo><mr
   ow><mi
   mathvariant="bold-italic">X</mi></mrow><mo>)</mo></mrow><mo>→</mo><mi>F
   </mi><mrow><mo>(</mo><mrow><mi>D</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="bold-italic">X</mi></mrow><mo>)</mo></mrow></mrow><mo>)</m
   o></mrow><mo>→</mo><mi>C</mi><mrow><mo>(</mo><mrow><mi>F</mi><mrow><mo>
   (</mo><mrow><mi>D</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="bold-italic">X</mi></mrow><mo>)</mo></mrow></mrow><mo>)</m
   o></mrow></mrow><mo>)</mo></mrow> :MATH]
   from the input layer through the hidden layer to the output layer. In
   the feature extraction layer, which is dominated by convolutional
   layers, information was passed in the same chain manner. After
   inputting X, we obtained the corresponding output L in different hidden
   layers to open the black box process. In order to observe the space of
   middle features, PCA was used to reduce T dimensionality to principal
   components. The PCA result was visualised by the Matplotlib package in
   Python.

   To evaluate the correlation of hidden layer output with batch label and
   type label, respectively, we calculated NMI, ARI, and ASC using the
   following formulas. L was the layer output and C was the cluster labels
   used for the cluster evaluation.
   [MATH: <mi mathvariant="normal">MI</mi><mfenced close=")"
   open="("><mrow><mi mathvariant="bold">L</mi><mi
   mathvariant="bold-italic">,</mi><mspace width="0.25em"></mspace><mi
   mathvariant="bold">C</mi></mrow></mfenced><mo>=</mo><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi></mrow><mrow><mi>M</mi><mi>
   a</mi><msub><mrow><mi>x</mi></mrow><mrow><mi>L</mi></mrow></msub></mrow
   ></msubsup><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi></mrow><mrow><mi>M</mi><mi>
   a</mi><msub><mrow><mi>x</mi></mrow><mrow><mi>C</mi></mrow></msub></mrow
   ></msubsup><msub><mrow><mi>P</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>j
   </mi></mrow></msub><mfrac><mrow><mi>log</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>P</mi></mrow><mrow><mi>i</mi><mo>,</mo><
   mi>j</mi></mrow></msub></mrow></mfenced></mrow><mrow><msub><mrow><mi>P<
   /mi></mrow><mrow><mi>i</mi></mrow></msub><msub><mrow><mi>P</mi></mrow><
   mrow><mi>j</mi></mrow></msub></mrow></mfrac> :MATH]
   1
   [MATH: <mi>N</mi><mi>M</mi><mi>I</mi><mfenced close=")"
   open="("><mrow><mi mathvariant="bold">L</mi><mi
   mathvariant="bold-italic">,</mi><mi
   mathvariant="bold">C</mi></mrow></mfenced><mo>=</mo><mfrac><mrow><mn>2<
   /mn><mi>M</mi><mi>I</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">L</mi><mi mathvariant="bold-italic">,</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="bold">C</mi></mrow></mfenced></mrow><mrow><mi>H</mi><mfenc
   ed close=")" open="("><mrow><mi
   mathvariant="bold">L</mi></mrow></mfenced><mo>+</mo><mi>H</mi><mfenced
   close=")" open="("><mrow><mi
   mathvariant="bold">C</mi></mrow></mfenced></mrow></mfrac> :MATH]
   2

   In the above equations, the mutual information (MI) computed by the
   layer outputs L and the label cluster C.
   [MATH:
   <msub><mrow><mi>P</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow
   ></msub> :MATH]
   represents the joint distribution probability between i and j, and
   [MATH: <msub><mrow><mi>P</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   refers to the distribution probability of i.
   [MATH: <msub><mrow><mi>P</mi></mrow><mrow><mi>j</mi></mrow></msub>
   :MATH]
   refers to the distribution probability of j.
   [MATH: <mi>H</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold-italic">L</mi></mrow></mfenced> :MATH]
   and
   [MATH: <mi>H</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="bold">C</mi></mrow></mfenced> :MATH]
   represent the entropy values of L and C, respectively. The clusters of
   the output layer are clustered by the K-nearest neighbour algorithm.
   [MATH: <mi>A</mi><mi>R</mi><mi>I</mi><mfenced close=")"
   open="("><mrow><mi mathvariant="bold">L</mi><mi
   mathvariant="bold-italic">,</mi><mspace width="0.25em"></mspace><mi
   mathvariant="bold">C</mi></mrow></mfenced><mo>=</mo><mn>2</mn><mfrac><m
   row><mi>T</mi><mi>P</mi><mo>×</mo><mi>T</mi><mi>N</mi><mo>−</mo><mi>F</
   mi><mi>N</mi><mo>×</mo><mi>F</mi><mi>P</mi></mrow><mrow><mrow><mo>(</mo
   ><mrow><mi>T</mi><mi>P</mi><mo>+</mo><mi>F</mi><mi>P</mi></mrow><mo>)</
   mo></mrow><mrow><mo>(</mo><mrow><mi>F</mi><mi>N</mi><mo>+</mo><mi>T</mi
   ><mi>N</mi></mrow><mo>)</mo></mrow><mo>+</mo><mrow><mo>(</mo><mrow><mi>
   T</mi><mi>P</mi><mo>+</mo><mi>F</mi><mi>P</mi></mrow><mo>)</mo></mrow><
   mrow><mo>(</mo><mrow><mi>F</mi><mi>P</mi><mo>+</mo><mi>T</mi><mi>N</mi>
   </mrow><mo>)</mo></mrow></mrow></mfrac> :MATH]
   3

   In the above equation, TP represents the number of point pairs
   belonging to the same cluster in both real and experimental cases, and
   FN represents the number of point pairs belonging to the same cluster
   in the real case but not in the same cluster in the experimental case.
   FP represents the number of point pairs not belonging to the same
   cluster in the real case but in the same cluster in the experimental
   case, and TN represents the number of point pairs not belonging to the
   same cluster in both real and experimental cases. The range of ARI is
   [−1, 1], and the larger the value, the more consistent with the real
   result, that is, the best effect of clustering.
   [MATH: <mi>A</mi><mi>S</mi><mi>C</mi><mfenced close=")"
   open="("><mrow><mi mathvariant="bold">L</mi><mi
   mathvariant="bold-italic">,</mi><mspace width="0.25em"></mspace><mi
   mathvariant="bold">C</mi></mrow></mfenced><mo>=</mo><mfrac><mrow><msubs
   up><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi></mrow><mrow><mi>N</mi></mr
   ow></msubsup><mfrac><mrow><msub><mrow><mi>b</mi></mrow><mrow><mi>i</mi>
   </mrow></msub><mo>−</mo><msub><mrow><mi>a</mi></mrow><mrow><mi>i</mi></
   mrow></msub></mrow><mrow><mi>max</mi><mfenced close="]"
   open="["><mrow><msub><mrow><mi>a</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>b</mi></mrow><mrow><mi>i</mi></
   mrow></msub></mrow></mfenced></mrow></mfrac></mrow><mrow><msub><mrow><m
   i>N</mi></mrow><mrow><mi>C</mi></mrow></msub></mrow></mfrac> :MATH]
   4

   The output layer was first dimensionally reduced by PCA, and the
   cluster was specified by the real label. In the above equation,
   [MATH: <msub><mrow><mi>a</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   represents the average of the dissimilarity of the vector i to other
   points in the same cluster, and
   [MATH: <msub><mrow><mi>b</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   represents the minimum of the dissimilarity of the vector i to points
   in other clusters.

Feature contributions

   In previous feature contribution studies, different branches used
   different methods to compute feature contributions to final
   classifications. These methods can help to better understand features
   and their impacts on model predictions. Gradient-based methods, such as
   GradCAM, calculate the gradients of the last convolutional layer by
   backpropagation of the category with the highest confidence^[219]69.
   Due to its convenience, this method is widely used in computer vision
   tasks. But it has a significant problem, that is, the resolution of the
   feature contribution heatmap is extremely low and cannot reach the
   requirements for distinguishing most signal peaks. The size of the
   feature contribution heatmap corresponds to the last convolutional
   layer of the model. The weight of the feature contribution is the
   average of the gradients of all features. On the other hand,
   perturbation-based methods, such as RISE and Local Interpretable
   Model-Agnostic Explanations, measure the importance of features by
   obscuring some pixels in raw data^[220]37,[221]70. The predictive
   ability of the black box is then observed to show how much this affects
   the prediction. Perturbation-based methods can lead to higher
   resolution and more accurate contribution estimates, but their runtimes
   are longer. To improve the computing speed in this study, we made some
   improvements based on RISE, using boost sampling for the mask process.

   Using RISE, we can determine the characteristic contributions of RT and
   m/z for each sample according to its true category. The feature
   contribution heatmap uses RT as the horizontal axis and m/z as the
   vertical axis to show the feature contribution of different positions
   of each sample. The average feature contribution of all samples
   correctly predicted to be of their true category is taken as the
   feature contribution of the category. At the same time, by performing
   peak extraction in the previous steps, we determined the RT value range
   and the m/z median value for each signal peak. The characteristic
   contribution associated with the RT and median m/z coordinates is then
   identified as the distinctive contribution of the signal peak.

Network analysis and pathway enrichment

   The extracted metabolic peaks with a contribution score greater than
   0.70 to the lung cancer classification were filtered. Mass-to-charge
   ratio and some substance identification information of these
   metabolites and their classification contribution scores were used as
   input data. For some of the metabolic signal peaks, we have accurately
   identified their molecular formulae and substance names by secondary
   mass spectrometry as substance identification information. Due to the
   limitation of existing databases, many unknown signals cannot be
   identified through secondary mass spectrometry. Therefore, PIUMet was
   also adopted to search for hidden metabolites and related proteins.

   PIUMet built disease-related metabolite-protein networks based on the
   prize-collecting Steiner Forest algorithm. First, PIUMet integrated
   iRefIndex (v13), HMDB (v3) and Recon2 databases to obtain the
   relationship between m/z, metabolites and proteins, and generated an
   overall metabolite-protein network. The edges were weighted by the
   confidence level of the correlation reported in these databases.
   iRefIndex provides details on the physical interactions between
   proteins, which are detected through different experimental methods.
   The protein-metabolite relationships in Recon2 are based on their
   involvement in the same reactions. HMDB includes proteins that are
   involved in metabolic pathways or enzymatic reactions, as well as
   metabolites that play a role in protein pathways, based on known
   biochemical and molecular biological data. The disease-related
   metabolite peaks obtained by DeepMSProfiler were matched to the
   metabolite nodes of the overall network by their m/z, and directly to
   the terminal metabolite nodes of the overall network after annotation.
   The corresponding feature contributions obtained by DeepMSProfiler
   served as prizes for these metabolite nodes. The network structure was
   then optimised using the prize-collecting Steiner Forest algorithm to
   minimise the total network cost and connect all terminal nodes, thereby
   removing low-confidence relationships and obtaining disease-related
   metabolite sub-networks.

   Metabolite identification is an important issue in metabolomics
   research and there are different levels of confidence in
   identification. Referring to the highest level considered^[222]71, we
   analysed authentic chemical standards and validated 11 of the
   metabolites discovered by PIUMet with only m/z (Supplementary
   Table [223]5). Then, disease-related metabolites and proteins were used
   to analyse their pathways^[224]39. These hidden metabolites and
   proteins from PIUMet were then processed for KEGG pathway enrichment
   analysis using MetaboAnalyst (v6.0). We used joint pathway analysis in
   MetaboAnalyst and chose hypergeometric test for enrichment analysis and
   degree centrality for topology measure. The integrated version of KEGG
   pathways (year 2019) was adopted by MetaboAnalyst. Pathways were
   filtered out using 1e-5 as a p value cut-off^[225]72. The corresponding
   SYMBOL IDs of the proteins were converted to KEGG IDs by the
   ClusterProfiler package in R^[226]73.

Ablation experiment

   We searched the PubMed database for a total of 5088 articles using the
   terms “serum”, “lung cancer” and “metabolism” from 2010 to 2022. By
   reading the titles and abstracts of them, we excluded publications that
   used non-serum samples such as urine and tissue for research, as well
   as publications that used non-mass spectrometry methods such as
   chromatography, nuclear magnetic resonance, and infrared spectroscopy.
   We then further screened the selected literature to exclude studies
   that did not result in the discovery of metabolic biomarkers. Finally,
   49 publications were remained and 811 serum metabolic biomarkers for
   lung cancer were reported. Some of the literature provides information
   on the retention time and mass-to-charge ratio of biomarkers. However,
   in other literature, only the name of the identified biomarker is
   given. Therefore, we searched the molecular weights of these
   metabolites in the HMDB database based on the literature information to
   match the corresponding m/z. The use of metabolite molecular weights to
   match the m/z took full account of the effect of adducts. Based on the
   number of publications of biomarkers in the literature, we determined
   the range of retained signals to be the m/z corresponding to biomarkers
   that exceeded the threshold number of publications. We filtered the
   signals in the raw data to exclude signals that did not fall into the 3
   ppm intervals around these m/z. The filtered raw data were used as
   input to the model.

Statistical analysis

   All statistical analysis calculations were performed using the stat
   package in Python. The distribution of data was plotted using the
   Seaborn package in Python. The correlation between patient information
   and labels was calculated using Pearson’s, Spearman’s and Kendall’s
   correlation coefficients. Pearson’s correlation coefficient was
   preferred to find linear correlations. Spearman’s and Kendall’s rank
   correlation coefficients were used to discover non-linear correlations.
   P-values below 0.05 were considered significant.

Figure preparation

   The main figures in this paper were assembled in Adobe Illustrator. The
   photo of mass spectrometry instruments was taken from actual objects.
   The data structure diagrams were obtained by fitting simulated
   functions based on python. Some cartoon components were drawn through
   FigDraw ([227]www.figdraw.com) with a license code (TAIRAff614) for
   free use.

AI-assisted technologies in the writing process

   At the end of the preparation of this work, the authors used ChatGPT to
   proofread the manuscript. After using this tool, the authors reviewed
   and edited the content as needed and take full responsibility for the
   content of the publication.

Reporting summary

   Further information on research design is available in the [228]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [229]Supplementary Information^ (4.4MB, pdf)
   [230]Reporting Summary^ (299.2KB, pdf)
   [231]Supplementary Data 1^ (29.6KB, xlsx)
   [232]Supplementary Data 2^ (50.2KB, xlsx)
   [233]Supplementary Data 3^ (35.1KB, xlsx)

Source data

   [234]Source data^ (5MB, xlsx)
   [235]Transparent Peer Review file^ (101.3KB, pdf)

Acknowledgements