Abstract
Untargeted metabolomic analysis using mass spectrometry provides
comprehensive metabolic profiling, but its medical application faces
challenges of complex data processing, high inter-batch variability,
and unidentified metabolites. Here, we present DeepMSProfiler, an
explainable deep-learning-based method, enabling end-to-end analysis on
raw metabolic signals with output of high accuracy and reliability.
Using cross-hospital 859 human serum samples from lung adenocarcinoma,
benign lung nodules, and healthy individuals, DeepMSProfiler
successfully differentiates the metabolomic profiles of different
groups (AUC 0.99) and detects early-stage lung adenocarcinoma (accuracy
0.961). Model flow and ablation experiments demonstrate that
DeepMSProfiler overcomes inter-hospital variability and effects of
unknown metabolites signals. Our ensemble strategy removes
background-category phenomena in multi-classification deep-learning
models, and the novel interpretability enables direct access to
disease-related metabolite-protein networks. Further applying to lipid
metabolomic data unveils correlations of important metabolites and
proteins. Overall, DeepMSProfiler offers a straightforward and reliable
method for disease diagnosis and mechanism discovery, enhancing its
broad applicability.
Subject terms: Machine learning, Metabolomics, Cancer metabolism
__________________________________________________________________
Untargeted metabolomic analysis provides comprehensive metabolic
profiling but faces challenges in medical application. Here, the
authors present an explainable deep learning method for end-to-end
analysis on raw metabolic signals to differentiate metabolomic profiles
of cancers with high accuracy.
Introduction
Metabolomics offers a comprehensive view of small molecule
concentrations within a biological system and plays a pivotal role in
the discovery of disease biomarkers for diagnostic purpose^[50]1.
Liquid chromatography mass spectrometry (LC-MS) is a widely practiced
experimental technique in global metabolomic studies^[51]2,[52]3. High
sensitivity, stability, reproducibility, and detection throughput are
the unique advantages of untargeted LC-MS^[53]4. Despite its capacity
to measure thousands of ion peaks, the conventional metabolomic study
by LC-MS remains a challenging task due to laborious data processing
such as peak picking and alignment, metabolite annotation by comparing
to authenticated databases, and data normalisation to control unwanted
variability in a large-scale study^[54]5,[55]6. The broader application
of metabolomics in precision medicine may be impeded by obstacles such
as complex data processing, high inter-batch variability, and
burdensome metabolite identification^[56]7.
Untargeted metabolomics has been conducted on various human biological
fluids, including serum and plasma, for the discovery of biomarkers in
cancers such as hepatocellular carcinoma^[57]8, pancreatic^[58]9,
prostate^[59]10, and lung cancers^[60]11–[61]13. However, such
biomarker discovery studies utilising metabolomics face significant
challenges regarding reproducibility, likely due to signal drifts in
cross-batch or cross-platform analysis^[62]14 and the limited
integration of data from different laboratory samples^[63]15.
Furthermore, unknown metabolites are excluded when comparing detected
features to authenticated databases^[64]16, which may hinder our
ability in discovering new biomarkers associated with diseases. Several
previous studies have combined machine learning with LC-MS for in vitro
disease diagnosis and improved the efficiency of LC-MS data analysis.
For example, Huang et al. conducted machine learning to extract serum
metabolic patterns from laser desorption/ionisation mass spectrometry
to detect early-stage lung adenocarcinoma^[65]11; Chen et al. adopted
machine learning models to conduct targeted metabolomic data analysis
to identify non-invasive biomarker for gastric cancer diagnosis^[66]17;
Shen et al. developed a deep-learning-based Pseudo-Mass Spectrometry
Imaging method and applied it in the prediction of gestational age of
pregnant women, as well as the diagnosis of endometrial cancer and
colon cancer^[67]18. However, these studies still face challenges such
as batch effects and unknown metabolites in metabolomics^[68]7.
Consequently, a new analytical approach is urgently needed to overcome
the experimental bottlenecks and reveal disease-associated profiles
comprising both identified and unknown components derived from LC-MS
peaks.
Deep learning has been widely applied in various omics data analyses,
holding promise for addressing the complexities of metabolomic
data^[69]19. The encoding and modelling capabilities of deep learning
offer a potential solution to overcome the aforementioned bottlenecks
in handling intricate and high-dimensional data, mitigating bias in
machine learning algorithms^[70]20,[71]21. However, deep learning
necessitates high-quality data and a sufficient quantity of samples,
otherwise leading to issues like the curse of dimensionality and the
overfitting of predictive models^[72]22. Moreover, integrating large
dataset collected from multiple hospitals may introduce significant
variations. Furthermore, as deep learning methods are usually perceived
as “black-box” processes, the importance of model interpretability for
prediction in the context of biomedical research is increasingly
recognised^[73]23–[74]25. Therefore, a deep learning model with both
interpretability for biological soundness and capability to mitigate
batch effects is highly desirable to enhance the reliability of
large-scale metabolomic analyses for diagnostic purposes.
In this study, we develop an ensemble end-to-end deep learning method
named as deep learning-based mass spectrum profiler (DeepMSProfiler)
for untargeted metabolomic data analysis. We firstly apply this method
to differentiate healthy individuals and patients with benign lung
nodules or lung adenocarcinoma using 859 serum samples from three
distinct hospitals, followed by its extended analysis on lipid
metabolomic data derived from 928 cell lines to reveal metabolites and
proteins associated with multiple cancer types. Without the process of
peak extraction and identification as well as potential errors by
conventional machine learning approaches, our method directly converts
raw LC-MS data into outputs such as predicted classification, heatmaps
illustrating key metabolite signals specific to each class, and
metabolic networks that influence the predicted classes. Importantly,
DeepMSProfiler effectively removes undesirable batch effects and
variations across different hospitals and infers the unannotated
metabolites associated with specific classifications. Furthermore, it
leverages an ensemble-model strategy that optimises feature attribution
from multiple individual models. DeepMSProfiler achieved an area under
the receiver operating characteristic curve (AUC) score of 0.99 in an
independent testing dataset, along with an accuracy of 96.1% in
detecting early-stage lung adenocarcinoma. The results are explainable
through locating relevant biological components as contribution factors
to prediction. Our method provides a straightforward and reliable
approach for metabolomic applications in disease diagnosis and
mechanism discovery.
Results
The overview of the ensemble end-to-end deep-learning model
The DeepMSProfiler method includes three main components: the
serum-based mass spectrometry, the ensemble end-to-end model, and the
disease-related LC-MS profiles (Fig. [75]1a). In the first component,
the raw LC-MS-based metabolomic data was generated using 859 human
serum samples (Fig. [76]1a left) collected from 210 healthy
individuals, 323 benign lung nodules, and 326 lung adenocarcinomas. The
space of the LC-MS raw data contains three dimensions: retention time
(RT), mass-to-charge ratio (m/z), and intensity. Using the RT and m/z
dimensions, the data can be mapped from three-dimensional space into
the frequency and time domains, respectively (Fig. [77]1b left to
middle). Ion current maps and primary mass spectra can then be
generated and used for metabolite identification (Fig. [78]1b middle to
right). Conventional step-by-step methods of metabolomic
analysis^[79]5,[80]22 (Supplementary Fig. [81]1 top) may lead to a
large number of lost metabolic signals. To address these issues,
DeepMSProfiler directly takes untargeted LC-MS raw data as model input,
and builds an end-to-end deep learning model to profile disease-related
metabolic signals (Supplementary Fig. [82]1 bottom).
Fig. 1. The DeepMSProfiler method using LC-MS-based untargeted serum
metabolome.
[83]Fig. 1
[84]Open in a new tab
a The overview of DeepMSProfiler. Serum samples of different
populations (top left) were collected and sent to the instrument
(bottom left) for liquid chromatography-mass spectrometry (LC-MS)
analysis. The raw LC-MS data, containing information on retention time
(RT), mass-to-charge ratio (m/z), and intensity, is used as input to
the ensemble model (middle). Multiple single convolutional neural
networks form the ensemble model (centre) to predict the true label of
the input data and generate three outputs (right), including the
predicted sample classes, the contribution heatmaps of
classification-specific metabolic signals, and the
classification-specific metabolic networks. b The data structure of raw
data. The mass spectra of different colours (centre) represent the
corresponding m/z and ion intensity of ion signal groups recorded at
different RT frames. All sample points are distributed in a
three-dimensional space (left) which can be mapped along three axes to
obtain chromatograms, mass spectra, and two-dimensional matrix data.
Chromatograms and mass spectra are used for conventional qualitative
and quantitative analysis (right), while the two-dimensional matrix
serves as input data for convolutional neural networks. c The structure
of a single end-to-end model. The input data undergoes the pre-pooling
processing to reduce dimensionality and become three-channel data. As
the model passes through each convolutional layer (conv) in the feature
extractor module, the weights associated with the original signals
change continuously. The sizes of different frames in the enlarged
layers (top) represent different receptive fields, with DenseNet
allowing the model to generate more flexible receptive field sizes.
After the last fully connected layer (FC), the classifications are
resulted.
The main model adopts an ensemble strategy and consists of multiple
sub-models (Fig. [85]1a middle). The ensemble strategy is considered to
be able to provide better generalisation^[86]26 according to the
complexity of the hypothesised space^[87]27, the local optimal
search^[88]27 and the parameter diversity^[89]28,[90]29 by the random
training process, as well as the bias-variance theory^[91]30–[92]32.
The structure of each sub-model consists of three parts: a pre-pooling
module, a feature extraction module, and a classification module
(Fig. [93]1c). The pre-pooling module transforms three-dimensional data
into two-dimensional space through a max-pool layer, effectively
reducing dimensionality and redundancy while preserving global signals
(Fig. [94]1c left). The feature extraction module is based on a
convolutional neural network to perform classification tasks by
extracting category-related features (Fig. [95]1c middle). DenseNet121
is chosen as the backbone of the feature extraction module due to its
highest accuracy and the least number of parameters (Supplementary
Data [96]1). Additionally, the design of densely connected
convolutional networks in DenseNet121^[97]33 makes the model more
flexible in terms of receptive field size, allowing adaptation to
different RT intervals of metabolic signal peaks. After the non-linear
transformation by the feature extraction layer, different weights could
be assigned to the input original signal peaks for subsequent
classification. We evaluated the effect of the number of sub-models on
improving performance and consequently chose to use 18 sub-models for
DeepMSProfiler (Supplementary Fig. [98]2). The classification module
implements a simple dense neural network to compute the probabilities
of different classes (Fig. [99]1c right).
In the third component, each input sample results in three outputs from
the model, including the predicted classification of the sample, the
heatmap for the locations of the key metabolic signals, and the
metabolic network that influences the predicted category (Fig. [100]1a
right). In the predicted classification, the category with the highest
probability is assigned as the predicted label of the model. In the
heatmap presentation, the key metabolic signals associated with
different classifications are inferred by the perturbation-based
method, and the m/z and RT of the key metabolic signals can be located.
Finally, the model infers the underlying metabolite-protein networks
and metabolic pathways associated with these key metabolic signals
directly from m/z (see Methods).
Model performance metrics
To test DeepMSProfiler’s ability in classifying disease states and
discovering disease-related profiles, we performed the global
metabolomic analysis of serum samples from healthy individuals and
patients with benign lung nodules or lung cancer. We collected serum
samples from three different hospitals to construct and validate the
model. Benign nodule cases were followed for up to 4 years and lung
adenocarcinoma samples were pathologically examined (see Methods). We
built the model using 859 untargeted LC-MS samples, of which 686 as the
discovery dataset and 173 as the independent testing dataset
(Fig. [101]2a). The samples were generated from 10 batches, as shown in
Supplementary Table [102]1. Statistics analysis shows a significant
correlation between lesion size and disease type, but the correlations
between other clinical factors are not significant (Supplementary
Table [103]2). To avoid confounding effects by clinical features, we
performed further distribution statistics analysis, which shows no
significant difference in the distribution of lesion diameter and
patient age in both the discovery dataset and the independent testing
dataset (Supplementary Fig. [104]3). The samples in the discovery
dataset were randomly divided into two subsets: a training dataset
(80%) for parameter optimisation and a validation dataset (20%) to
cross-validate the performance of different models. Remarkably, the
accuracies of our DeepMSProfiler in the training dataset, validation
dataset, and independent testing dataset are 1.0, 0.92, and 0.91,
respectively (Supplementary Table [105]3).
Fig. 2. The prediction performance of DeepMSProfiler.
[106]Fig. 2
[107]Open in a new tab
a The sample allocation chart. The outer ring indicates the types of
diseases and the inner ring indicates the sex distribution. Healthy:
healthy individuals; Benign: benign lung nodules; Malignant: lung
adenocarcinoma. b Predicted receiver operating characteristic (ROC)
curves of different methods. Random: performance baseline in a random
state. Comparison of performance metrics of different methods (n = 50):
accuracy (c), precision (d), recall (e), and F1 score (f). The blue
areas show the different conventional analysis processes using machine
learning methods, and the red areas display different end-to-end
analysis processes using deep learning methods. The boxplot shows the
minimum, first quartile, median, third quartile and maximum values,
with outliers as outliers. g Model accuracy rates for different age
groups. The sample sizes for different groups are 52, 69, 40, and 12,
respectively. h Model accuracy rates for different lesion diameter
groups. The sample sizes for different groups are 27, 37, 18, 13, and
34, respectively. The boxplot shows the minimum, first quartile,
median, third quartile and maximum values. i Prediction accuracy and
parameter scale of different model architectures. j The confusion
matrix of the DeepMSProfiler model. The numbers inside the boxes are
the number of matched samples between the true label and the predicted
label. The ratio in parentheses is the number of matched samples
divided by the number of all samples of the true label.
In the independent testing dataset, DeepMSProfiler significantly
outperforms traditional methods and single deep learning models
(Fig. [108]2b–h). Compared with Support Vector Machine (SVM), Random
Forest (RF), Deep Learning Neural Network (DNN), Adaptive Boosting
(AdaBoost), and Extreme Gradient Boosting (XGBoost) based on
traditional methods, and Densely Connected Convolutional Networks
(DenseNet121) using raw data, DeepMSProfiler presents the highest areas
under the curve (AUC) of 0.99 (Fig. [109]2b). Notably, DeepMSProfiler
exhibits higher specificity than other models while maintaining high
sensitivity, indicating its ability to accurately identify true
negatives (Supplementary Table [110]4).
Regarding overall performance against other models, our model achieves
the best performance in multiple evaluation metrics: accuracy of 95%
(95% CI, 94%–97%) (Fig. [111]2c), precision of 96% (95% CI, 94%–97%)
(Fig. [112]2d), recall of 95% (95% CI, 94%–96%) (Fig. [113]2e), and F1
of 98% (95% CI, 97%–98%) (Fig. [114]2f). Compared to XGBoost, our model
performs better in different groups of lesion sizes and ages
(Fig. [115]2g, h), except for samples from patients over 70 years old.
DeepMSProfiler is also superior to commonly used single deep learning
models, such as DenseNet121 (Fig. [116]2i). When using the ensemble
strategy, we did not set different weights for each sub-model, so each
sub-model is equally involved in the final prediction. The confusion
matrices of prediction performance for each sub-model (Supplementary
Fig. [117]4) show their contributions to the overall results. The
robust performance, coupled with the efficiency in terms of
computational resources, makes DeepMSProfiler a promising choice for
the classification tasks (Fig. [118]2i).
Furthermore, DeepMSProfiler demonstrates consistent performance across
different categories. All of the AUCs for lung adenocarcinoma, benign
lung nodules, and healthy individuals achieves 0.99 (Supplementary
Fig. [119]5), and their respective classification accuracies are 85.7%,
90.8%, and 97.0% (Fig. [120]2j). Most importantly, our model has good
performance for detecting stage-I of lung adenocarcinoma with an
accuracy of 96.1%, indicating its potential as an effective method for
early lung cancer screening.
Insensitivity to batch effects
Batch effect is one of the most common error sources for the analysis
of metabolomic data. To evaluate the impact of batch effects on the
non-targeted LC-MS data, we first generated 3 biological replicates as
reference samples for each of the 10 batches. These reference
replicates were taken from a mixture of 100 healthy human serum
samples, and each of them contains equal amounts of isotopes including
^13C-lactate, ^13C[3]-pyruvate, ^13C-methionine, and ^13C[6]-isoleucine
(see Methods). The differences in the data structure of reference
replicates from different batches are visualised in 3D and 2D
illustrations (Fig. [121]3a and Supplementary Fig. [122]6), which
indicate the changes of shapes and area as well as the RT shifts among
different batches. Comparison of individual isotopic peaks in samples
from different batches also shows that the batch effects are mainly in
the form of RT shifts, and the differences in peak shapes and areas for
the same metabolites (Fig. [123]3b).
Fig. 3. Explainable deep learning method avoids limitations of batch effects
in conventional methods.
[124]Fig. 3
[125]Open in a new tab
a Batch effects in 3D point array and 2D mapped heatmap of reference
samples. RT: retention time; m/z: mass-to-charge ratio. b Isotope peaks
of the same concentration in different samples. Different colours
represent the batches to which the samples belong. c The visualisation
of dimensionality reduction of normalisation by the Reference Material
method. Below: different colours represent different classes; Above:
different colours indicate different batches. Healthy: healthy
individuals; Benign: benign lung nodules; Malignant: lung
adenocarcinoma. d The visualisation of dimensionality reduction for the
output data of the hidden layers in DeepMSProfiler. Conv1 to Conv5 are
the outputs of the first to the fifth pooling layer in the feature
extraction module. Block4 and Block5 are the outputs of the fourth and
fifth conv layers in the fifth feature extraction module. Upper:
different colours indicate different sample batches; Lower: different
colours represent different population classes. e Correlation of the
output data of the hidden layer with the batch and class information in
DeepMSProfiler. The horizontal axis represents the layer names. Conv1
to Conv5 are the outputs of the first to the fifth pooling layer in the
feature extraction module. Block10 and Block16 are the outputs of the
tenth and sixteenth conv layers in the fifth feature extraction module.
The blue line represents the batch-related correlations, and the orange
line illustrates the classification-related correlations. f The
accuracy rates of traditional methods (blue), corrected methods based
on reference samples (purple), and DeepMSProfiler (red) in independent
testing dataset (n = 50). The boxplot shows the minimum, first
quartile, median, third quartile and maximum values, with outliers as
outliers.
We then investigated the batch effect corrections and compared the
performance between DeepMSProfiler and conventional correction methods
(see Methods). As shown in Fig. [126]3c, after correction by the
Reference Material (Ref-M) method^[127]34, we still observed 3 clusters
in the principal component analysis (PCA) profiles, which represent the
sample dots from three different hospitals. While Ref-M effectively
addresses the batch effect within samples from the same hospital, the
residual variation across hospitals remains (Fig. [128]3c). Samples of
batch 1–6 and 9–10 were obtained from the Sun Yat-Sen University Cancer
Centre, and samples of batch 8 came from the First Affiliated Hospital
of Sun Yat-Sen University. Samples of batch 7 were a mixture from three
different hospitals. Among them, lung cancer samples came from the
Affiliated Cancer Hospital of Zhengzhou University, lung nodule samples
came from the First Affiliated Hospital of Sun Yat-Sen University, and
healthy samples came from the Sun Yat-Sen University Cancer Centre. 100
healthy human serum samples used as reference during conventional
procedures were all from the Sun Yat-Sen University Cancer Centre,
which might be the main reason why it is difficult to correct batch
effect for batch 7–9. To illustrate the DeepMSProfiler’s end-to-end
process in the automatic removal of batch effects, we extracted the
output of the hidden layer and visualised the flow of data during the
forward propagation of the network (see Methods). From the input layer
to the output layer, the similarity between different batches becomes
progressively higher, while the similarity between different types
becomes progressively lower (Fig. [129]3d and Supplementary
Fig. [130]7). DenseNet121 is a deep neural network with 431 layers, of
which 120 are convolutional layers (Supplementary Data [131]1). The
fourth and fifth layers refer to the output of the fourth dense
connected module and the output of the fifth dense connected module.
There are 112 layers between them. Figure [132]3d and Supplementary
Fig. [133]7 illustrate the intermediate change process from the output
of the fourth closely connected module to the output of the fifth
closely connected module. When the batch effects are removed, the
classification becomes clearer. We further quantified this process of
change using different metrics that measure the correlation between the
PCA clusters and the given labels i.e., K-nearest neighbour batch
effect test score^[134]35, local inverse Simpson’s index^[135]36,
adjusted rand index (ARI), normalised mutual information (NMI), and
average silhouette coefficient (ASC) (see Methods). The closer to the
output layer, the less relevant the data is to batch labels and the
more relevant to class labels (Fig. [136]3e). This explains how the
batch effect removal is achieved via progress through hidden layers
(Fig. [137]3d). This capability might be gained via the supervised
learning. Our findings suggest that in the forward propagation process,
the DeepMSProfiler model excludes batch-related information from the
input data layer by layer, while retaining class-related information.
Further, we compared the performance of our deep learning method
against machine learning methods with and without batch effect
correction. The correction of batch effects could improve accuracies
when using machine learning classifiers such as SVM, RF, AdaBoost, and
XGBoost. However, DeepMSProfiler, without any additional manipulation,
surpassed the machine learning methods with or without batch effect
correction in terms of prediction accuracy (Fig. [138]3f).
The impact of unknown mass spectrometric signals
To investigate the impact of unknown metabolite signals on
classification prediction, we first performed the conventional analysis
with peak extraction and metabolite annotation using existing databases
such as Human Metabolome Database (HMDB) and Kyoto Encyclopedia of
Genes and Genomes (KEGG) (see Methods). We found that 83.5% of all
detected features remain as unknown metabolites (Fig. [139]4a). The
absence of these unknown metabolites undermines the prediction accuracy
(Fig. [140]4b), indicating this large number of unknown metabolites may
impose a significant impact on classification performance. One of the
advantages of our approach over traditional methods is the ability to
retain complete metabolomic features including the unknown metabolites.
Fig. 4. The unknown space in databases and previous studies.
[141]Fig. 4
[142]Open in a new tab
a Statistics of annotated metabolite peaks. Blue colour represents all
peaks, orange, purple and while yellow colours indicate metabolites
annotated in HMDB, KEGG, and all databases, respectively. The overlap
between orange and purple includes 414 metabolites annotated in both
HMDB and KEGG. HMDB: Human Metabolome Database; KEGG: Kyoto
Encyclopedia of Genes and Genomes. b The feature selection plot
illustrates the effect of different contribution score thresholds
removing unknown metabolites versus non-removing. The horizontal axis
represents the change in threshold, while the vertical axis shows the
accuracy of the model using the remaining features. The shadings of
solid lines (mean) represent error bars (standard deviation). c
Collection standard of published lung cancer serum metabolic biomarker.
SCLC: Small Cell Lung Cancer; LUSC: Lung Squamous Cell Carcinoma; LUAD:
Lung Adenocarcinoma; NAR: Nuclear Magnetic Resonance; MS: Mass
Spectrometry. d The number counts of known biomarkers published in the
current literature. e Molecular weight distribution plot of known
biomarkers. f Accuracy comparison between the ablation experiment and
DeepMSProfile (n = 50). The boxplot shows the minimum, first quartile,
median, third quartile and maximum values. In the ablation experiment,
we investigated the effect of varying the publication count (PC) of
known biomarkers in the literature. Specifically, we eliminated
metabolic signals that were not reported in the original data based on
the m/z of known biomarkers. We retained only the metabolic signals
with publication counts greater than 1, greater than 3, and greater
than 8 for modelling development. All ablated data was analysed using
the same architecture as the original unprocessed data in the same
DeepMSProfiler architecture. The vertical axis shows the accuracy of
models built on the dataset of different publication counts and our
DeepMSProfiler.
We then tested the limitation of lung cancer prediction using
biomarkers identified by annotated metabolites. We collected 826
biomarkers for lung cancer based on serum mass-spectrometry analysis
from 49 publications (Fig. [143]4c). We deduced the molecular weight of
the biomarkers from the HMDB database and the information in these
publications (see Methods). Only 42.7% of the biomarkers discovered
based on traditional methods appear in more than two articles, and
their reproducibility is suboptimal (Fig. [144]4d). In addition, the
molecular weights of these metabolites are mainly distributed in the
range of 200–400 Da (Fig. [145]4e). We found that prediction
performance of DeepMSProfiler using complete raw data is highly
accurate compared with the one using only the corresponding m/z signals
of the reported biomarkers (Fig. [146]4f). This indicates that there
are still unknown metabolomic signals in the serum samples related to
lung cancer that has not been unveiled in the current research. In
contrast, DeepMSProfiler derives the classifications directly from the
complete signals in the raw data.
Explainability to uncover the black box
After the model construction based on deep learning, we sought to
explain the classification basis of the black box model and identify
the key signals for specific classifications. We adopted a
perturbation-based method, Randomised Input Sampling for Explanation
(RISE)^[147]37, to count feature contributions. We slightly modified
RISE to improve its operational speed and efficiency, and developed a
method to evaluate the importance of RISE scores in different
classifications (see Methods).
Interestingly, we found a “background category” phenomenon in some of
the single models (Fig. [148]5a). For each class in a single
tri-classification model of DenseNet121, the classification performance
gradually deteriorates as the metabolic signals with higher
contribution scores are removed. However, there is one category that is
always unaffected by all the features involved in the classification
decision, while maintaining a very high number of true positives and
false positives. These findings imply that the tri-classification model
only predicts the probabilities of two of the categories, and
calculates the probability of the third category from the results of
the other two. In other words, the classification-related metabolites
associated with the “foreground category” negatively contribute to the
“background category”. Furthermore, the categories used as “background
category” are not consistent across the different models. We were
intrigued to test whether this phenomenon occurs exclusively in
metabolomic data, so we conducted a seven-classification task using the
Photo-Art-Cartoon-Sketch (PACS) image dataset^[149]38. We observed a
similar phenomenon in the resultant feature scoring of different models
(Supplementary Fig. [150]8). This suggests that “background category”
may generally exist in multi-classification task by single models,
although its underlying mechanism is currently unclear and may require
future investigation.
Fig. 5. Feature constructions of prediction and significant metabolic signals
relevant to biological pathways.
[151]Fig. 5
[152]Open in a new tab
a Prediction performance and feature scoring by different single
models. b Prediction performance and feature scoring by DeepMSProfiler.
c Heatmap matrices of classification contribution in healthy
individuals (Healthy), benign lung nodules (Benign), and lung
adenocarcinoma (Malignant). The horizontal and vertical axes of the
matrix are the prediction label and the true label, respectively. The
heatmaps of upper left, the middle one, and the bottom right represent
the true healthy individuals, the true benign nodules, and the true
lung adenocarcinoma, respectively. The horizontal and vertical axes of
each heatmap are RT and m/z, respectively. The classification
contributions of metabolites corresponding to true healthy individuals
(d), benign nodules (e), and lung adenocarcinoma (f). The horizontal
axis represents the retention time and the vertical axis represents the
m/z of the corresponding metabolites. The colours represent the
contribution score of the metabolites. The redder the colour, the
greater the contribution to the classification. Metabolites-proteins
network for healthy individuals (g), benign nodules (h), and lung
adenocarcinoma (i). Pathway enrichment analysis using the signalling
networks for healthy individuals (j), benign nodules (k), and lung
adenocarcinoma (l) (FDR < 0.05). m/z: mass charge ratio, RT: retention
time.
In contrast, the phenomenon of “background category” no longer exists
when the feature contributions are calculated by our ensemble model. As
shown in Fig. [153]5b, when we progressively eliminated metabolic
signals in each category according to their contribution, their
performance of the ensemble model decreased accordingly. Each
individual model captures different features that contribute to their
corresponding classifications, while the ensemble model could combine
these features to improve the accuracy of disease prediction and reduce
the possibility of overfitting.
Metabolomic profiles in lung adenocarcinoma, benign nodules, and healthy
individuals
To analyse the global metabolic differences between lung
adenocarcinoma, benign nodules, and healthy individuals, we extracted
the heatmaps of feature contributions counted by RISE from
DeepMSProfiler (Fig. [154]5c). As shown in Fig. [155]5d–f, the
horizontal and vertical labels in the heatmaps represent m/z and
retention time respectively. By mapping the label information to the
heatmaps, we are able to locate the metabolites corresponding to
different m/z and retention times to obtain their feature contribution
scores. In true-positive healthy and benign nodule samples, the
metabolic signals with the most significant contribution are uniformly
located between 200 and 400 m/z and in 1–3 min (Fig. [156]5d, e). In
comparison, the metabolic signals located between 200 and 600 m/z and
in 1–4 min contribute most in lung adenocarcinoma samples, but signals
in other regions also have relatively high scores (Fig. [157]5f).
As higher contribution scores in the heatmaps represent more important
correlations, we screened signals with scores above 0.70 and attempted
to identify the corresponding metabolic profiles in each classification
(Supplementary Data [158]2). As observed in Fig. [159]5b, by retaining
metabolic signals with a contribution score above 0.7, the overall
accuracy is around 0.8, which manages to maintain an efficient
classification impact. Considering the RT shift among different
batches, we matched metabolic peaks only by m/z. We then fed these m/z
signals, together with metabolites identified by tandem mass
spectrometry (MS2), into the analysis tool PIUMet based on
protein–protein and protein-metabolite interactions^[160]39 to build
disease-associated feature networks (Fig. [161]5g–i). As the network
shown in Fig. [162]5i, 82 proteins and 121 metabolites are matched in
the lung adenocarcinoma samples, including 9 already identified by MS2
and 111 hidden metabolites found by the correlation between key
metabolic peaks. As such, the current analysis based on protein-protein
and protein-metabolite interactions allows the discovery of unknown
metabolic signals associated with diseased states, although the
resolution of the current model might be relatively low in
distinguishing all individual peaks contributing to the disease
classification. In order to explore the biological explainability,
among the features extracted by PIUMet, we also selected 11 metabolites
(Supplementary Table [163]5) with available authentic standards in our
laboratory to justify their presence in the lung cancer serum samples.
Indeed, these metabolites could be identified in the lung cancer serum
as described in our previous study^[164]40. We further analysed the
metabolic networks to explore the biological relevance associated with
each classification. The heatmaps (Fig. [165]5d, e) and pathway
analysis (Fig. [166]5j, k) consistently show that healthy individuals
and benign nodules share similar metabolic profiles. In contrast, the
cancer group presents a distinct profile with specific pathways and
increasing counts of metabolites or proteins in the shared pathways
with healthy individuals or benign nodules (Fig. [167]5f, l). The
detailed metabolites and protein candidates were further shown in
Supplementary Data [168]3. Taken together, our network and pathway
analyses demonstrated the interpretability of DeepMSProfiler based on
deep learning.
Application of model in colon cancer
Considering the transferability of DeepMSProfiler, we obtained a public
colon cancer LC-MS dataset which contained 236 samples from
MetaboLights (ID is MTBLS1129). There are 197 colon cancer samples and
39 healthy human samples in the dataset. We randomly divided this
dataset into discovery dataset and independent testing dataset at 4:1
ratio. The discovery dataset contained 157 colon cancer samples and 31
healthy control samples. The independent testing dataset contained 40
colon cancer patient samples and 8 healthy control samples.
Due to the differences in cancer types and mass spectrometry analysis
procedures between the colon cancer dataset and the lung adenocarcinoma
dataset, we re-trained the DeepMSProfiler model. The colorectal cancer
data was randomly divided into a discovery dataset and an independent
testing dataset, and the discovery dataset was further randomly divided
into a training dataset and a validation dataset with multiple times.
In the independent testing dataset of the colon cancer dataset, our
model achieved an accuracy of 97.9% (95% CI, 97.7%–98.1%), a precision
of 98.7% (95% CI, 98.6%–98.8%), a recall of 93.4% (95% CI,
92.9%–94.1%), and an F1 of 95.8% (95% CI, 95.4%–96.2%) (Supplementary
Fig. [169]9). These results suggest an excellent transferability of
DeepMSProfiler.
Discovery of metabolic-protein networks in pan-cancer
In a continued effort to investigate the capabilities of DeepMSProfiler
in analysing metabolomics data across multiple cancer types, raw lipid
metabolomic data of 928 cell lines spanning 23 cancer types were
collected from the Cancer Cell Line Encyclopaedia (CCLE)
database^[170]2 and then subjected to processing by DeepMSProfiler.
Notably, in addition to the raw metabolomic data, these cell lines also
contain valuable data of annotated metabolites, methylation, copy
number variations, and mutations^[171]2. DeepMSProfiler constructed a
model encompassing the 23 distinct categories, followed by a feature
extraction from the 23-category model to identify the respective
crucial metabolic signals of each category. Due to the limited number
of samples for many cancer types, particularly for biliary tract,
pleura, prostate, and salivary gland cancers, each with less than 10
samples, we did not set a separate independent testing dataset for the
performance validation. 20 sub-models have been trained, and in each
sub-model training, 80% of all samples were randomly allocated for
training to ensure that every sample could contribute to the training
process, especially for cancer types with very few samples. The final
ensemble model used for explainable analysis achieved 99.3% accuracy,
97.2% sensitivity, and 100% specificity. Next, the priority-collecting
Steiner forest optimisation algorithm^[172]39 was employed to unveil
the correlation between pivotal metabolic signals and proteins using
databases of HMDB^[173]41, Recon2^[174]42 and iRefIndex^[175]43 (see
Methods).
As results, we successfully generated disease-specific
metabolite-protein networks (Fig. [176]6a–c) along with a contribution
score heatmap (Fig. [177]6e), where contribution scores exceeding 0.70
were considered indicative of disease-specific metabolites. Metabolites
identified within the metabolite-protein network were directly inferred
from the mass-to-charge ratio (m/z) of metabolic signals from the raw
data using feature spectra extracted by the DeepMSProfiler model.
Notably, we identified 14 metabolites and 3 proteins that exhibited
co-occurrence within the 23 cancer-related metabolite-protein networks
(Fig. [178]6d). Finally, we correlated the metabolic data and the
methylation information and subsequently verified the associations
between the PLA and UGT gene families and the disease-specific
metabolites of high contribution (Fig. [179]6f). Previous
studies^[180]44–[181]48 have reported the important roles of PLA and
UGT gene families in a variety of diseases, such as PLA2G7 and PLA2G6
in breast and prostate cancers and neurodegenerative diseases, as well
as UGT3A2 in head and neck cancers. These evidences support our
findings by DeepMSProfiler. In summary, our extended analysis spanning
pan-cancer scenarios highlights the capability of DeepMSProfiler in the
discovery of potential disease-associated metabolites and proteins.
Fig. 6. Metabolite and protein associations across 23 cancer types.
[182]Fig. 6
[183]Open in a new tab
Metabolite-protein networks for (a) lung cancer, (b) gastric cancer,
and (c) leukaemia. Yellow squares: metabolites. Red circles: proteins.
Blue labels: metabolites and proteins shared in 23 cancer
metabolite-protein networks. d Metabolites and proteins shared in the
metabolite-protein networks of 23 cancer types. e Heatmap of the
classification contribution of different lipid metabolites across 23
cancer types. f Correlation of important pan-cancer-related metabolites
with methylation of the PLA and UGT gene families.
Discussions
Metabolomics faces challenges in precision medicine due to complex
analytical process, metabolic diversity, and database
limitations^[184]5,[185]6. DeepMSProfiler starts with raw untargeted
metabolomic data and retains essential information, enabling more
effective global analysis. It offers an alternative approach by
directly processing raw data of metabolomic samples, bypassing
time-consuming experiments such as quality control or reference sample
preparation and subsequent normalisation analysis.
In metabolomic study, systematic variations in the measured metabolite
profiles may occur during sample collection, processing, analysis, or
even in different batches of reagents or instruments. Batch effects can
significantly impact the interpretation of the results, leading to
inconsistencies in replicating findings across different
studies^[186]15. While batch effects can manifest as variations in
retention time (RT) offset, peak area, and peak shape, conventional
quantitative methods often prioritise peak area integration while
overlooking peak shape^[187]49,[188]50. Significantly, our results
demonstrate that DeepMSProfiler is able to automatically eliminate
cross-hospital variations during the end-to-end forward propagation
process (Fig. [189]3d), effectively revealing classification profiles.
Moreover, DeepMSProfiler can address the challenges of unidentified
metabolites. LC-MS metabolomics can reveal tens of thousands of
metabolite peaks in a biological sample. A substantial number of these
peaks remains unidentified or unannotated in existing databases. In
this study, we demonstrated that among all detected peaks, only 16.5%
are identified by HMDB and KEGG. However, the presence of a significant
proportion of unknown metabolites has a considerable influence on the
accuracy of classification (Fig. [190]4b). Indeed, annotating
metabolomic peaks has remained a major study focus in the
field^[191]16. A common approach involves comparing the exact mass of
detected peaks with authenticated standards, along with either the
retention time or the fragmentation spectra obtained through tandem
mass spectrometry (MS2). Despite significant development of molecular
structural databases and MS2 spectral databases, their current
capabilities and coverage remain limited^[192]51. In addition, network
analysis, which examines complex peak relationships and clusters, has
also been developed to facilitate the comprehensive identification of
metabolites^[193]52. In this study, we employed the deep learning
method to capture original signals in LC-MS metabolomic analysis
without compromising data integrity. We further implemented a direct
transition from m/z to pathway annotations by taking advantage of the
network-based analysis tool PIUMet^[194]39, effectively identifying 82
proteins and 121 metabolites in the cancer group, compared with 9
metabolites annotated by MS2.
Furthermore, our method is able to cover the metabolites identified by
conventional annotation and simultaneously uncovers the undetected
disease-specific features. In the traditional metabolomic analysis,
biomarkers specific to the disease of interest are usually sought by
comparison of metabolite levels between control and case samples.
Therefore, peak alignment and metabolite annotation are crucial to the
end results. Here, by employing the end-to-end strategy, we unveiled
the complete biological outputs that contribute to the distinct
metabolomic profiles of each group. For example, tryptophan metabolism
was identified among the characteristics of lung adenocarcinoma profile
(Fig. [195]5l). The result was consistent with our previous discovery
by the conventional annotation method that metabolites in the
tryptophan pathway were decreased in the early-stage lung
adenocarcinoma compared with benign nodules and healthy
controls^[196]40. Serine and glycine are also important for nucleotide
synthesis by mediating one-carbon metabolism, which is relevant to
therapeutic strategy targeting non-small cell lung
cancer^[197]53–[198]56. Intriguingly, we also observed the contribution
of bile secretion in the lung adenocarcinoma profile (Fig. [199]5l),
which aligns with another report of aberrant bile acid metabolism in
invasive lung adenocarcinoma^[200]57. However, it should be noted that
the resolution of our model may be limited to distinguish all
individual peaks contributing to the disease classification.
We additionally demonstrated that among deep learning models, ensemble
models are more stable and class-balanced than single models. Although
we have not fully comprehended the reason for the occurrence of
“background category”, the ensemble strategy has effectively mitigated
this phenomenon (Fig. [201]5a, b). Our investigation on the PACS image
dataset suggests that “background category” may generally exist in
multi-classification tasks using single models. Understanding its
underlying mechanism requires further investigation with a broader
range of dataset.
The high-resolution heatmaps generated by DeepMSProfiler display the
feature contributions to the predicted classes and the precise location
of specific metabolomic signals (Fig. [202]5c), providing explainable
analysis to assure the researchers of the biological soundness of the
prediction. With the capability of batch effect removal, comprehensive
metabolomic profiling, and ensemble strategy, DeepMSProfiler
demonstrates consistent and robust performance across different
categories. It achieves AUCs over 0.99 for the predictions of lung
adenocarcinoma, benign nodules, and healthy samples, and an accuracy of
96.1% for early-stage (stage-I) lung adenocarcinoma. Moreover, its
extended analysis in pan-cancer illustrates it ability to uncover
potential disease-associated metabolites and proteins beyond lung
cancer. In conclusion, our DeepMSProfiler offers a straightforward and
reliable method suitable for applications in disease diagnosis and
mechanism discovery, potentially advancing the use of metabolomics in
precision medicine. Its effective end-to-end strategy applied to raw
metabolomic data can benefit a broader population in non-invasive
clinical practices for disease screening and diagnosis.
Methods
Clinical sample collection
This study was approved by the Ethics Committees of the Sun Yat-Sen
University Cancer Centre, the First Affiliated Hospital of Sun Yat-Sen
University and the Affiliated Cancer Hospital of Zhengzhou University.
A total of 210 healthy individuals, 323 patients with benign nodules
and 326 patients with lung adenocarcinoma were enroled. Cases of lung
adenocarcinoma were collected prior to lung resection surgery and had
pathological confirmation. Serum from benign nodules was collected from
individuals undergoing annual physical examinations. Participants with
benign nodules were defined as those with stable 3–5 years follow-up
Computed Tomography (CT) scans at the time of analysis. The sample
collection period was from January 2018 to October 2022. The sex of the
participants was determined by self-report. Informed consent was
obtained from all participants. The study design and conduct complied
with all relevant regulations regarding the use of human study
participants and was conducted in accordance to the criteria set by the
Declaration of Helsinki. Research with humans has been conducted
according to the principles of the Declaration of Helsinki.
In addition, we collected serum samples from 100 healthy blood donors,
including 50 males and 50 females, aged between 40 and 55 years, from
the Department of Cancer Prevention and Medical Examination, Sun
Yat-Sen University Cancer Centre. All these samples were mixed in equal
amounts and the resultant mixture was aliquoted and stored. These
mixtures were used as reference samples for quality control and data
normalisation in the conventional metabolomic analysis as previously
described^[203]34.
Serum metabolite extraction
Fasting blood samples were collected in serum separation tubes without
the addition of anticoagulants, allowed to clot for 1 h at room
temperature, and then centrifuged at 2851 × g for 10 min at 4 °C to
collect the serum supernatant. The serum was aliquoted and then frozen
at −80 °C until metabolite extraction.
Reference serum and study samples were thawed and a combined extraction
method (methyl tert-butyl ether/methanol/water) was used to extract
metabolites. Briefly, 50 μL of serum was mixed with 225 μL of ice-cold
methanol and 750 μL of ice-cold methyl-tertbutyl ether (MTBE). The
mixture was vortexed and incubated for 1 h on ice. Then 188 μL MS grade
water containing internal standards (^13C-lactate, ^13C[3]- pyruvate,
^13C-methionine and ^13C[6]-isoleucine, all from Cambridge Isotope
Laboratories) was added and vortexed. The mixture was centrifuged at
15,000 × g for 10 min at 4 °C, and then the bottom phase was
transferred to two tubes (each containing 125 μL) for LC-MS analysis in
positive and negative modes. Finally, the samples were dried in a
high-speed vacuum concentrator.
Untargeted liquid chromatography-mass spectrometry
The dried metabolites were resuspended in 120 μL of 80% acetonitrile,
vortexed for 5 min and centrifuged at 15,000 × g for 10 min at 4 °C.
The supernatant was transferred to a glass amber vial with a micro
insert for metabolomic analysis. Untargeted metabolomic analysis was
performed on an ultra-performance liquid chromatography-high resolution
mass spectrometry (UPLC-HRMS) platform. The metabolites were separated
using the Dionex Ultimate 3000 UPLC system with an ACQUITY BEH Amide
column (2.1 × 100 mm, 1.7 μm, Waters). In positive mode, the mobile
phase comprised 95% (A) and 50% acetonitrile (B), containing 10 mmol/L
ammonium acetate and 0.1% formic acid. In negative mode, the mobile
phase was composed of 95% and 50% acetonitrile for phases A and B,
respectively, both containing 10 mmol/L ammonium acetate and adjusted
to pH 9. Gradient elution was performed as follows: 0–0.5 min, 2% B;
0.5–12 min, 2–50% B; 12–14 min, 50–98% B; 14–16 min, 98% B;
16–16.1 min, 98–2% B; 16.1–20 min, 2% B. The column temperature was
maintained at 40 °C, and the autosampler was set at 10 °C. The flow
rate was 0.3 mL/min, and the injection volume was 3 μL. A Q-Exactive
orbitrap mass spectrometer (Thermo Fisher Scientific) with an
electrospray ionisation (ESI) source was operated in full scan mode
coupled with ddMS2 monitoring mode for mass data acquisition. The
following mass spectrometer settings were used: spray voltage
+3.8 kV/−3.2 kV; capillary temperature 320 °C; sheath gas 40 arb;
auxiliary gas 10 arb; probe heater temperature 350 °C; scan range
70–1050 m/z; resolution 70000. Xcalibur 4.1 (Thermo Fisher Scientific)
was used for data acquisition.
In this study, all serum samples were analysed by LC-MS in 10 batches.
To assess data quality, a mixed quality control (QC) sample was
generated by pooling 10 μL of supernatant from each sample in the
batch. Six QC samples were injected at the beginning of the analytical
sequence to assess the stability of the UPLC-MS system, with additional
QC samples injected periodically throughout the batch. Serum pooled
from 100 healthy donors was used as reference material in each batch to
monitor extraction and batch effects. All untargeted metabolomic
analysis was performed at the Sun Yat-Sen University Metabolomics
Centre.
Public dataset collection
The raw dataset for pan-cancer lipid metabolomics data of CCLE was
downloaded from the Metabolomics Workbench database with accession
ST001142
([204]https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=
Study&StudyID=ST001142). There are 946 samples in total, including 23
cancer types. The quantitative lipid metabolite matrix and the DNA
methylation matrix were downloaded from the appendix of the
article^[205]2.
The LC-MS dataset of colon cancer was downloaded from the MetaboLights
database
([206]https://www.ebi.ac.uk/metabolights/editor/MTBLS1129/descriptors)
with 236 samples in total, including 197 colon cancer cases and 39
healthy controls. Due to the differences of disease samples,
classification purposes, instruments, and parameters of LC-MS between
the public dataset and the private lung adenocarcinoma dataset, the
DeepMSProfiler model needs to be re-trained on the public dataset.
Data format conversion
The raw format files of LC-MS data were converted to mzML format using
the MSCovert software. The data used to train the end-to-end model were
sampled directly from the mzML format without any further processing.
This raw data could be used directly as input to the model. In the mzML
file, ion intensity and mass-to-charge ratio of each ion point for each
time interval were recorded. Ions points were mapped into a 3D space by
their RT and m/z. A 2D matrix was sampled from this 3D points array
data using a maximally pooled convolution kernel. RT: 0.5 min and m/z:
50 as the sampling starting point and RT: 0.016 min and m/z: 1 as the
sampling interval. The sampling ranges of retention time and
mass/charge ratio were set. Using the sampling interval as a sliding
window, the maximum ion intensity in the interval was sampled to obtain
a two-dimensional matrix of 1024 × 1024 ion intensities.
Extraction and annotation of metabolic peaks
We used Compound Discovery v3.1 and TraceFinder v4.0 (Thermo Fisher
Scientific) for peak alignment and extraction. These steps resulted in
a matrix containing retention time, mass-to-charge ratio and peak area
information for each metabolite. To eliminate potential batch-to-batch
variation, we used the Ref-M method to correct peak areas. This
involved dividing the peak area of each feature in the study sample by
the peak area of the reference compound from the same batch, yielding
relative abundances. We then used several data annotation tools such as
MetID, MetDNA and NetID in MZmine to annotate the metabolite features
and combined the annotation results^[207]52,[208]58–[209]60. These
analysis tools include mass spectral information from databases such as
the HMDB, MassBank and MassBank of North America
(MoNA)^[210]41,[211]61. In addition, we performed data annotation using
MS2 data based on Compound Discovery v3.1 and finally selected
bio-compounds with MS2 matches to certified standards or compounds with
inferred annotations based on complete matches in mzCloud (score > 85)
or ChemSpider as the precise annotation results^[212]62.
Raw data visualisation
The mzML files were read using the Pyteomics package in Python. Records
were traversed for all times in the sampling interval^[213]63. For each
time index data in mzML files, it recorded the preset scan
configuration, the scan window, the ion injection time, the intensity
array, and the m/z array. The intensity array and m/z array were
selected to form an array of data points, and retention time,
mass-to-charge ratio, and intensity are the row names. The intensity
values were log2 processed. Then, the 3D point cloud data was
visualised using the Matplotlib Toolkits package in Python^[214]64. The
2D matrixes were obtained by down-sampling the 3D point cloud and
pooling the 3D data using median and maximum convolution kernels.
Convolution spans were RT: 0.1 min and m/z: 0.001. Heatmaps and
contours were plotted using Matplotlib. Retained time-intensity curves
were also plotted using Matplotlib with an m/z span of 0.003.
Dataset division and assignment
The dataset of each batch was randomly divided into a training dataset
and an independent testing dataset in a ratio of 4:1. The data from the
first to the seventh batch contained 90 samples each, including 30
healthy individuals, 30 lung nodules, and 30 lung adenocarcinoma
samples. The data for the eighth and ninth batches did not contain
healthy samples. The data for the tenth batch only contained nodule
samples. To avoid the effect of classification imbalance, we
constrained the same sample type and sex ratio in the training and
independent testing dataset. Because the samples came from patients of
different ages and sexes, the lesion sizes of lung nodules and lung
adenocarcinoma patients also varied. In order to avoid these attributes
affecting the authenticity of the model, sex, age, and lesion size were
also used as constraints for dataset division. The difference in the
distribution of sample attributes between the training dataset and the
independent testing dataset was verified by an unpaired t-test.
Deep learning model construction in detail
In this step, we aimed to construct a model to predict the class labels
for each metabolomic sample. For this, we first set X and Y as the
input and label spaces, respectively. A single end-to-end model
consisted of three parts, a dimension converter based on pool
processing, a feature detector based on the convolutional neural
networks, and a multi-layer perceptron (MLP) classifier. The input data
directly from the raw data was extremely large and contained a lot of
redundant information, so a pooling process was required to reduce the
resolution for downstream convolution operations. The input data of the
model was reduced by the maximum pooling layer to obtain D(X). Next,
enter the feature extractor dominated by convolutional layers to obtain
F(D(X)). The convolutional neural network had local displacement
invariance and was well adapted to the RT offset problem in metabolomic
data. Due to the relatively large size of the data, more than 5 layers
of convolutional operations were required to reduce the dimensionality
of the data to the extent that the computing power could be loaded.
Different architectures were used respectively to compare the
performance in the tuning set. The architectures used in different
models included VGGNet (VGG16, VGG19), Inception Model (InceptionV3),
ResNet (ResNet50), DenseNet (DenseNet121), and EfficientNet
(EffcientNetB0-B5)^[215]33,[216]65–[217]67. In addition, two
optimisation models based on Densenet121 were created to simplify the
DenseNet network. The direct connection route replaced the last dense
layer of Densenet121 with a convolutional layer. The optimisation route
replaced the last dense layer of DenseNet with a convolutional layer
that retained a one-hop connection. The pre-training parameters in
pre-trained models were derived from ImageNet. Each architecture was
tested on the TensorFlow + Keras platform and PyTorch platform,
respectively. To reduce overfitting, we used only one linear layer for
our MLP layer. In the TensorFlow + Keras model, there was a SoftMax
activation layer before the output layer. The output of the model was
C(F(D(X))).
The positive and negative spectral data used different convolutional
layers for feature extraction. Their features were combined before
inputting the fully connected layer. Their pre-training parameters were
shared. For a model trained on both positive and negative spectral
data, a cross entropy loss was used.
Model training
20% of the discovery dataset was divided into tuning sets, which were
used to guide model architecture selection, hyperparameter selection,
and model optimisation, and the rest 80% was used for model training.
Sample category balancing was used as a constraint for dataset
segmentation. The model architecture was evaluated by both validation
performance and operational performance. We counted the number of model
parameters and evaluated the complexity of the model. The average of
the 10 running times of the models was used as the runtime. Hyperas was
used to preliminarily select the optimal model hyperparameters and
initialisation weights^[218]68. The optimal initialisation method was
he_normal. But we opted for pretraining with the ImageNet dataset due
to its comparable performance and faster execution. After reducing the
size of the parameter search, we used the grid search method for
hyperparameter tuning.
Ensemble strategy
DeepMSProfiler consists of several trained end-to-end sub-models as an
ensemble model, where the average of the classification prediction
probabilities of the samples from all sub-models was used as the final
prediction probability for classification. The ensemble model
calculated a score vector of length 3 in each of the three
classifications, and the category with the maximum score was selected
as the predicted classification result.
Each end-to-end sub-model was trained on the discovery dataset. The
architecture of each sub-model is the same, but some hyperparameters
are different. Two different learning rates of 1e-3 and 1e-4 were used.
The optimiser used is ‘adam’ with parameter settings of beta_1 as 0.9,
beta_2 as 0.999, epsilon as 0.001, decay as 0.0, and amsgrad as False.
The batch size was set as 8 and the training was run for 200 epochs. A
model typically took about 2 h to complete training on a GP100GL (16GB)
GPU server. Each sub-model participated fairly in the final prediction
result without setting a specific weight. The independent testing
dataset was not used in model training and hyperparameter selection.
Machine learning models for comparison
To compare our DeepMSProfiler to other existing tools, we selected
several common traditional machine learning methods to build
tri-classification models based on the peak area data obtained from the
previous steps. These methods included Extreme Gradient Boosting
(XGBoost), RF, Adaptive Boosting (Adaboost), SVM, and DNN. The training
dataset and independent testing dataset were divided in the same way as
the deep learning algorithm, and the numbers of estimators for Adaboost
and XGBoost algorithms were the same as those of DeepMSProfiler.
XGBoost was implemented by the XGBClassifier function in the xgboost
library. Other machine learning methods were implemented using the
SciKitLearn library. SVM was adopted using the svm function, and the
kernel of SVM is ‘linear’. RF was implemented through the
RandomForestClassifier function. Adaboost was adopted through the
AdaBoostClassifier function. DNN was implemented using the
MLPClassifier function. The optimal hyperparameter was obtained by the
grid search method.
Performance metrics
We evaluated the performance of the model on the independent testing
dataset. The evaluation metrics included accuracy, precision,
sensitivity and F1 score. Micro was chosen as the computational method
for the multiclassification model. Confidence intervals were estimated
using 1000 bootstrap iterations. During the bootstrapping procedure,
our model was estimated by an ensemble strategy combining 20 models
trained on the discovery dataset. In addition, we calculated a
confusion matrix and an AUC curve to demonstrate the performance of the
model in the three classifications of lung adenocarcinoma, benign
nodules and healthy individuals. When the sensitivity was 0.7 or 0.9,
the specificity was calculated using the sensitivity-specificity curve.
The sensitivity-specificity curve was interpolated using the NEAST
method.
Visualisation of “black-box”
In the end-to-end neural network prediction, the data flowed in a chain
of
[MATH: X→D(X)→F
(D(X))→C(F
(D(X))) :MATH]
from the input layer through the hidden layer to the output layer. In
the feature extraction layer, which is dominated by convolutional
layers, information was passed in the same chain manner. After
inputting X, we obtained the corresponding output L in different hidden
layers to open the black box process. In order to observe the space of
middle features, PCA was used to reduce T dimensionality to principal
components. The PCA result was visualised by the Matplotlib package in
Python.
To evaluate the correlation of hidden layer output with batch label and
type label, respectively, we calculated NMI, ARI, and ASC using the
following formulas. L was the layer output and C was the cluster labels
used for the cluster evaluation.
[MATH: MIL,C=∑iM
axL∑jM
axCPi,j
logPi,<
mi>jP<
/mi>iP<
mrow>j :MATH]
1
[MATH: NMIL,C=2<
/mn>MIL,CHL+HC :MATH]
2
In the above equations, the mutual information (MI) computed by the
layer outputs L and the label cluster C.
[MATH:
Pi,j :MATH]
represents the joint distribution probability between i and j, and
[MATH: Pi
:MATH]
refers to the distribution probability of i.
[MATH: Pj
:MATH]
refers to the distribution probability of j.
[MATH: HL :MATH]
and
[MATH: HC :MATH]
represent the entropy values of L and C, respectively. The clusters of
the output layer are clustered by the K-nearest neighbour algorithm.
[MATH: ARIL,C=2TP×TN−F
mi>N×FP(TP+FP)
mo>(FN+TN)+(
TP+FP)<
mrow>(FP+TN
) :MATH]
3
In the above equation, TP represents the number of point pairs
belonging to the same cluster in both real and experimental cases, and
FN represents the number of point pairs belonging to the same cluster
in the real case but not in the same cluster in the experimental case.
FP represents the number of point pairs not belonging to the same
cluster in the real case but in the same cluster in the experimental
case, and TN represents the number of point pairs not belonging to the
same cluster in both real and experimental cases. The range of ARI is
[−1, 1], and the larger the value, the more consistent with the real
result, that is, the best effect of clustering.
[MATH: ASCL,C=∑iNbi
−ai
mrow>maxai,bi
mrow>NC :MATH]
4
The output layer was first dimensionally reduced by PCA, and the
cluster was specified by the real label. In the above equation,
[MATH: ai
:MATH]
represents the average of the dissimilarity of the vector i to other
points in the same cluster, and
[MATH: bi
:MATH]
represents the minimum of the dissimilarity of the vector i to points
in other clusters.
Feature contributions
In previous feature contribution studies, different branches used
different methods to compute feature contributions to final
classifications. These methods can help to better understand features
and their impacts on model predictions. Gradient-based methods, such as
GradCAM, calculate the gradients of the last convolutional layer by
backpropagation of the category with the highest confidence^[219]69.
Due to its convenience, this method is widely used in computer vision
tasks. But it has a significant problem, that is, the resolution of the
feature contribution heatmap is extremely low and cannot reach the
requirements for distinguishing most signal peaks. The size of the
feature contribution heatmap corresponds to the last convolutional
layer of the model. The weight of the feature contribution is the
average of the gradients of all features. On the other hand,
perturbation-based methods, such as RISE and Local Interpretable
Model-Agnostic Explanations, measure the importance of features by
obscuring some pixels in raw data^[220]37,[221]70. The predictive
ability of the black box is then observed to show how much this affects
the prediction. Perturbation-based methods can lead to higher
resolution and more accurate contribution estimates, but their runtimes
are longer. To improve the computing speed in this study, we made some
improvements based on RISE, using boost sampling for the mask process.
Using RISE, we can determine the characteristic contributions of RT and
m/z for each sample according to its true category. The feature
contribution heatmap uses RT as the horizontal axis and m/z as the
vertical axis to show the feature contribution of different positions
of each sample. The average feature contribution of all samples
correctly predicted to be of their true category is taken as the
feature contribution of the category. At the same time, by performing
peak extraction in the previous steps, we determined the RT value range
and the m/z median value for each signal peak. The characteristic
contribution associated with the RT and median m/z coordinates is then
identified as the distinctive contribution of the signal peak.
Network analysis and pathway enrichment
The extracted metabolic peaks with a contribution score greater than
0.70 to the lung cancer classification were filtered. Mass-to-charge
ratio and some substance identification information of these
metabolites and their classification contribution scores were used as
input data. For some of the metabolic signal peaks, we have accurately
identified their molecular formulae and substance names by secondary
mass spectrometry as substance identification information. Due to the
limitation of existing databases, many unknown signals cannot be
identified through secondary mass spectrometry. Therefore, PIUMet was
also adopted to search for hidden metabolites and related proteins.
PIUMet built disease-related metabolite-protein networks based on the
prize-collecting Steiner Forest algorithm. First, PIUMet integrated
iRefIndex (v13), HMDB (v3) and Recon2 databases to obtain the
relationship between m/z, metabolites and proteins, and generated an
overall metabolite-protein network. The edges were weighted by the
confidence level of the correlation reported in these databases.
iRefIndex provides details on the physical interactions between
proteins, which are detected through different experimental methods.
The protein-metabolite relationships in Recon2 are based on their
involvement in the same reactions. HMDB includes proteins that are
involved in metabolic pathways or enzymatic reactions, as well as
metabolites that play a role in protein pathways, based on known
biochemical and molecular biological data. The disease-related
metabolite peaks obtained by DeepMSProfiler were matched to the
metabolite nodes of the overall network by their m/z, and directly to
the terminal metabolite nodes of the overall network after annotation.
The corresponding feature contributions obtained by DeepMSProfiler
served as prizes for these metabolite nodes. The network structure was
then optimised using the prize-collecting Steiner Forest algorithm to
minimise the total network cost and connect all terminal nodes, thereby
removing low-confidence relationships and obtaining disease-related
metabolite sub-networks.
Metabolite identification is an important issue in metabolomics
research and there are different levels of confidence in
identification. Referring to the highest level considered^[222]71, we
analysed authentic chemical standards and validated 11 of the
metabolites discovered by PIUMet with only m/z (Supplementary
Table [223]5). Then, disease-related metabolites and proteins were used
to analyse their pathways^[224]39. These hidden metabolites and
proteins from PIUMet were then processed for KEGG pathway enrichment
analysis using MetaboAnalyst (v6.0). We used joint pathway analysis in
MetaboAnalyst and chose hypergeometric test for enrichment analysis and
degree centrality for topology measure. The integrated version of KEGG
pathways (year 2019) was adopted by MetaboAnalyst. Pathways were
filtered out using 1e-5 as a p value cut-off^[225]72. The corresponding
SYMBOL IDs of the proteins were converted to KEGG IDs by the
ClusterProfiler package in R^[226]73.
Ablation experiment
We searched the PubMed database for a total of 5088 articles using the
terms “serum”, “lung cancer” and “metabolism” from 2010 to 2022. By
reading the titles and abstracts of them, we excluded publications that
used non-serum samples such as urine and tissue for research, as well
as publications that used non-mass spectrometry methods such as
chromatography, nuclear magnetic resonance, and infrared spectroscopy.
We then further screened the selected literature to exclude studies
that did not result in the discovery of metabolic biomarkers. Finally,
49 publications were remained and 811 serum metabolic biomarkers for
lung cancer were reported. Some of the literature provides information
on the retention time and mass-to-charge ratio of biomarkers. However,
in other literature, only the name of the identified biomarker is
given. Therefore, we searched the molecular weights of these
metabolites in the HMDB database based on the literature information to
match the corresponding m/z. The use of metabolite molecular weights to
match the m/z took full account of the effect of adducts. Based on the
number of publications of biomarkers in the literature, we determined
the range of retained signals to be the m/z corresponding to biomarkers
that exceeded the threshold number of publications. We filtered the
signals in the raw data to exclude signals that did not fall into the 3
ppm intervals around these m/z. The filtered raw data were used as
input to the model.
Statistical analysis
All statistical analysis calculations were performed using the stat
package in Python. The distribution of data was plotted using the
Seaborn package in Python. The correlation between patient information
and labels was calculated using Pearson’s, Spearman’s and Kendall’s
correlation coefficients. Pearson’s correlation coefficient was
preferred to find linear correlations. Spearman’s and Kendall’s rank
correlation coefficients were used to discover non-linear correlations.
P-values below 0.05 were considered significant.
Figure preparation
The main figures in this paper were assembled in Adobe Illustrator. The
photo of mass spectrometry instruments was taken from actual objects.
The data structure diagrams were obtained by fitting simulated
functions based on python. Some cartoon components were drawn through
FigDraw ([227]www.figdraw.com) with a license code (TAIRAff614) for
free use.
AI-assisted technologies in the writing process
At the end of the preparation of this work, the authors used ChatGPT to
proofread the manuscript. After using this tool, the authors reviewed
and edited the content as needed and take full responsibility for the
content of the publication.
Reporting summary
Further information on research design is available in the [228]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[229]Supplementary Information^ (4.4MB, pdf)
[230]Reporting Summary^ (299.2KB, pdf)
[231]Supplementary Data 1^ (29.6KB, xlsx)
[232]Supplementary Data 2^ (50.2KB, xlsx)
[233]Supplementary Data 3^ (35.1KB, xlsx)
Source data
[234]Source data^ (5MB, xlsx)
[235]Transparent Peer Review file^ (101.3KB, pdf)
Acknowledgements