Abstract

Background

   Plant secondary metabolites are highly valued for their applications in
   pharmaceuticals, nutrition, flavors, and aesthetics. It is of great
   importance to elucidate plant secondary metabolic pathways due to their
   crucial roles in biological processes during plant growth and
   development. However, understanding plant biosynthesis and degradation
   pathways remains a challenge due to the lack of sufficient information
   in current databases. To address this issue, we proposed a transfer
   learning approach using a pre-trained hybrid deep learning architecture
   that combines Graph Transformer and convolutional neural network (GTC)
   to predict plant metabolic pathways.

Results

   GTC provides comprehensive molecular representation by extracting both
   structural features from the molecular graph and textual information
   from the SMILES string. GTC is pre-trained on the KEGG datasets to
   acquire general features, followed by fine-tuning on plant-derived
   datasets. Four metrics were chosen for model performance evaluation.
   The results show that GTC outperforms six other models, including three
   previously reported machine learning models, on the KEGG dataset. GTC
   yields an accuracy of 96.75%, precision of 85.14%, recall of 83.03%,
   and F1_score of 84.06%. Furthermore, an ablation study confirms the
   indispensability of all the components of the hybrid GTC model.
   Transfer learning is then employed to leverage the shared knowledge
   acquired from the KEGG metabolic pathways. As a result, the transferred
   GTC exhibits outstanding accuracy in predicting plant secondary
   metabolic pathways with an average accuracy of 98.30% in fivefold
   cross-validation and 97.82% on the final test. In addition, GTC is
   employed to classify natural products. It achieves a perfect accuracy
   score of 100.00% for alkaloids, while the lowest accuracy score of
   98.42% for shikimates and phenylpropanoids.

Conclusions

   The proposed GTC effectively captures molecular features, and achieves
   high performance in classifying KEGG metabolic pathways and predicting
   plant secondary metabolic pathways via transfer learning. Furthermore,
   GTC demonstrates its generalization ability by accurately classifying
   natural products. A user-friendly executable program has been
   developed, which only requires the input of the SMILES string of the
   query compound in a graphical interface.

Supplementary Information

   The online version contains supplementary material available at
   10.1186/s12859-023-05485-9.

   Keywords: Metabolic pathway prediction, Plant secondary metabolism,
   Deep learning, Transfer learning, Graph Transformer

Background

   Plant secondary metabolites are metabolic intermediates and products,
   which are considered to be non-essential for the growth and survival of
   the organism. However, they not only participate in environmental
   responses like stress resistance [[35]1, [36]2] and disease resistance
   [[37]3, [38]4], but also are active ingredients of many herbal
   medicines [[39]5]. Research on plant secondary metabolism can guide the
   cultivation of excellent crops with desirable traits and contribute to
   the discovery of novel drugs for human diseases.

   Metabolomics is a powerful tool for exploring the production and
   functions of secondary metabolites [[40]6]. Through differential
   metabolite analysis, changes in metabolic pathways provide valuable
   insights into underlying biological mechanisms, such as pathogenic
   disease [[41]7] and crop selection [[42]8]. Pathway enrichment analysis
   based on metabolomics data may be biased. It is primarily due to the
   limited current knowledge of plant secondary metabolic pathways
   [[43]9], and certain pathways have not been included in existing
   databases due to the outdatedness and maintenance requirements
   [[44]10]. Exploring these pathways through biological experiments is
   time-consuming and resource-intensive. Therefore, it is crucial to
   develop new ways of improving the understanding of plant secondary
   metabolism.

   Several machine learning algorithms have been reported to predict the
   metabolic pathways of a compound, including traditional machine
   learning algorithms and deep learning models. In early research,
   traditional algorithms such as the Nearest Neighbor Algorithm [[45]11]
   and Adaboost [[46]12] were adopted to classify the compounds into a
   single pathway category, ignoring the possibility that a compound might
   belong to multiple metabolic pathways. Later, multi-target models were
   proposed based on chemical interaction [[47]13, [48]14] to predict the
   order of predicted metabolic pathway classes of the query compound. A
   bioinformatics tool, TrackSM [[49]15], was developed to extract
   scaffolds from compounds in the same metabolic pathways. It assumed
   that scaffolds (substructures) are associated with corresponding
   metabolic pathways. Besides, the Support Vector Machine (SVM), a
   powerful algorithm, has been widely used in many fields, including
   protein structural class prediction [[50]16]. In the context of
   metabolic pathway prediction, several studies have tackled the
   challenge of the multi-label classification by converting it into a
   binary classification task [[51]17–[52]20].

   Recently, deep learning models have been applied to predict metabolic
   pathways. The first deep learning model adopted in predicting metabolic
   pathways was a Graph Convolutional Network (GCN) [[53]21]. This
   prediction engine only requires the simplified molecular-input
   line-entry system (SMILES) [[54]22] string of the query compound and
   outperformed traditional machine learning algorithms such as Random
   Forest (RF) model. Afterward, the Graph Attentional Network (GAT) has
   been reported to have better performance [[55]23, [56]24]. However,
   most prediction results are limited to the 11 pathway classes in the
   Kyoto Encyclopedia of Genes and Genomes (KEGG) [[57]25], which is not
   refined enough. Although some investigation has been reported to
   classify compounds into sub-classes of metabolic pathways [[58]26],
   plentiful negative samples were introduced. Additionally, there is no
   graphical user interface available for users.

   Machine learning and deep learning models have demonstrated their
   potential in predicting metabolic pathways, which can greatly improve
   our understanding of plant secondary metabolism. However, these models
   require further refinement and expansion of their applications. Ongoing
   developments in Graph Neural Networks (GNN) and Convolution Neural
   Networks (CNN) are expected to contribute to this progress. Moreover,
   transfer learning can effectively improve the efficiency of model
   construction and solve the problem of insufficient data. In this paper,
   we propose a transfer learning approach using a pre-trained hybrid deep
   learning architecture that combines graph transformer [[59]27] and CNN
   (GTC) to predict plant metabolic pathways. Overall, the primary
   contributions of this article are as follows:

   A deep ensemble learning model The architecture of our proposed model,
   GTC, comprises two distinct blocks. A GNN-based block captures the
   structural features of the molecule by treating it as a graph, and a
   CNN-based block learns from the text information of the SMILES string.
   By combining these two representations, GTC provides a more
   comprehensive representation of molecules, leading to outperforming
   state-of-the-art methods in predicting metabolic pathways.

   Transfer strategy of fine-tuning To solve the problem of insufficient
   data on plant secondary metabolic pathways, transfer learning is
   implemented. After obtaining high performance of the pre-training model
   on the KEGG dataset, GTC is fine-tuned to perform the accurate
   prediction for plant secondary metabolic pathways.

   Model generalization The modular architecture of GTC allows for easy
   adaptation to various molecular prediction tasks beyond metabolic
   pathway classification. The model can be adapted to predict other
   molecular properties, such as classifying natural products, by simply
   modifying the output layer accordingly.

Methods

Datasets

   To implement a transfer learning process, separate datasets are
   required for both the target task and the source task, respectively. In
   this study, four datasets were used (Additional file [60]2: Table S1).
   Both Dataset A (Additional file [61]2: Table S2) and Dataset B
   (Additional file [62]2: Table S3) were created for the target task,
   while Dataset C (Additional file [63]2: Table S4) served as a benchmark
   dataset for the source task. Besides, Dataset D (Additional file [64]2:
   Table S5) was generated to evaluate the generalization ability of the
   GTC model. The detailed description is as follows:

   Dataset A houses 2028 plant-derived compounds, represented by SMILES
   strings and labeled with 18 pathway labels. The information was
   collected from Plant Metabolic Network (PWN) [[65]28] in October 2022.
   The 18 sub-classes are labeled as ‘0’, ‘1’, …, ‘17’ and encompass
   secondary metabolite biosynthesis and degradation pathways, including
   aroma compound biosynthesis, fatty acid derivative biosynthesis, … and
   terpenoid degradation (Additional file [66]2: Table S6). Since a single
   compound may belong to several metabolic pathways (Additional file
   [67]1: Fig. S1 and Fig. S2), it is a multi-label task. Besides, the
   unbalanced distribution of the 2028 compounds is observed in Additional
   file [68]2: Table S6. In the presence of an unbalanced distribution,
   deep learning model from scratch may overly focus on learning the
   majority classes, leading to increased errors when predicting the
   minority classes. To address the challenge, a transfer learning
   approach was adopted to leverage knowledge from a related dataset
   through pre-training. Our model benefits from pre-trained
   representations, ultimately resulting in improved performance on
   minority classes.

   Dataset B consists of 577 plant-derived compounds, which are classified
   into the same 18 pathway categories as Dataset A. The dataset was
   extracted from the updated PWN database in February 2023 and retained
   only the newly added compounds absent in Dataset A. It is noted that
   Dataset B is excluded from the training process and only served for the
   final testing evaluation. Dataset C comprises 5288 compounds and 11
   pathway classes in the KEGG database, downloaded in December 2022.
   Similarly, the compounds are also represented by SMILES strings. The 11
   classes of metabolic pathways include carbohydrate metabolism, energy
   metabolism, lipid metabolism, and so on. (Additional file [69]2: Table
   S7). Dataset C has great similarity in task and domain with Dataset A,
   as both datasets involve chemical structures as input and labels
   related to metabolic pathways. Additionally, Dataset C has been
   carefully annotated by domain experts, which ensures data reliability.
   Additional file [70]1: Fig. S3 and Additional file [71]2: Table S7 show
   that Dataset C provides enough data instances. Therefore, it is
   reasonable to consider Dataset C as the source data for transfer
   learning. Besides, such a dataset from the KEGG database was widely
   used in previous studies for the prediction of metabolic pathways.
   Dataset D contains 78,305 natural products, which are labeled by 7
   Pathways. It was originally collected by the researchers of
   NPClassifier [[72]29], which was reported for the classification of
   natural products by using counted Morgan fingerprints as input. The
   primary data with 78,336 compounds were downloaded at
   [73]https://github.com/mwang87/NP-Classifier in March 2023. The
   deduplication was conducted and created Dataset D, which was used to
   test the generalization ability of the GTC model.

Model construction

   The deep learning model, GTC, is an ensemble of a graph transformer
   network and a CNN. Figure [74]1 shows the overall framework of GTC as
   well as the pipeline of the whole transfer learning process, which is
   further explained in the section Transfer learning. The model comprises
   two separate blocks that capture molecular features from different
   perspectives. The first block focuses on extracting graph-based
   molecular structural features, while the second block learns text-based
   molecular structural information. By combining these two
   representations, the artificial neural network can comprehensively and
   effectively learn molecular structures. The integrated features are
   then passed through an output layer for prediction.

Fig. 1.

   [75]Fig. 1
   [76]Open in a new tab

   Flowchart of the proposed GTC. The whole pipeline contains two
   processes: pre-training and fine-tuning. The model is initially
   constructed on the KEGG dataset and then leverages the learned
   knowledge to predict plant secondary metabolic pathways. Conv1D stands
   for one-dimensional convolutional layer; Pool1D stands for
   one-dimensional max-pooling layer; FC stands for fully connected layer

   Block1 In this block, a molecular is seen as a graph composed of nodes
   and edges. Here, the atom is the node, while the chemical bond is the
   edge. Each atom i is thought to have connections with its neighbor j.
   Specifically, it is given that node features X = { x[1], x[2], x[3], …,
   x[n]}. Partially drawing on existing research [[77]24, [78]30], the
   node feature x[i] is initialized with the combination of the
   information of the atom symbol, the number of adjacent atoms, the
   number of adjacent hydrogens, the formal charge of the atom, the
   implicit valence of the atom, the chiral type of the atom, whether the
   atom is in a ring and whether the atom is in an aromatic structure,
   using RDKIT [[79]31]. In this way, the ith node feature x[i] and the
   jth node feature x[j] are obtained. Corresponding to the self-attention
   mechanism in the well-known Transformer[[80]32], there are three
   important elements in graph transformer: Query, Key, and Value, which
   are calculated by:
   [MATH:
   <mrow><mi>Q</mi><mi>u</mi><mi>e</mi><mi>r</mi><mi>y</mi><mo>=</mo><msub
   ><mi
   mathvariant="bold">W</mi><mrow><mi>c</mi><mo>,</mo><mi>q</mi></mrow></m
   sub><msub><mi
   mathvariant="bold">x</mi><mi>i</mi></msub><mo>,</mo></mrow> :MATH]
   [MATH: <mrow><mi>K</mi><mi>e</mi><mi>y</mi><mo>=</mo><msub><mi
   mathvariant="bold">W</mi><mrow><mi>c</mi><mo>,</mo><mi>k</mi></mrow></m
   sub><msub><mi
   mathvariant="bold">x</mi><mi>j</mi></msub><mo>,</mo></mrow> :MATH]
   [MATH:
   <mrow><mi>V</mi><mi>a</mi><mi>l</mi><mi>u</mi><mi>e</mi><mo>=</mo><msub
   ><mi
   mathvariant="bold">W</mi><mrow><mi>c</mi><mo>,</mo><mi>v</mi></mrow></m
   sub><msub><mi
   mathvariant="bold">x</mi><mi>j</mi></msub><mo>,</mo></mrow> :MATH]

   where W[c,q], W[c,k], and W[c,v] are three weighing matrices.
   Basically, in a Graph Transformer layer, the node feature is updated by
   aggregating the features of its neighboring atoms. With multi-head
   attention adopted, for the cth attention head, x[i] is updated by the
   following propagation rule:
   [MATH: <mrow><msubsup><mi
   mathvariant="bold">x</mi><mrow><mi>c</mi><mo>,</mo><mi>i</mi></mrow><mo
   >′</mo></msubsup><mo>=</mo><msub><mi
   mathvariant="bold">W</mi><mrow><mi>c</mi><mo>,</mo><mi>l</mi></mrow></m
   sub><msub><mi
   mathvariant="bold">x</mi><mi>i</mi></msub><mo>+</mo><msub><mo>∑</mo><mr
   ow><mi>j</mi><mo>∈</mo><mi>N</mi><mo
   stretchy="false">(</mo><mi>i</mi><mo
   stretchy="false">)</mo></mrow></msub><mfenced close=")"
   open="("><msub><mi>α</mi><mrow><mi>c</mi><mo>,</mo><mi>i</mi><mo>,</mo>
   <mi>j</mi></mrow></msub><msub><mi
   mathvariant="bold">W</mi><mrow><mi>c</mi><mo>,</mo><mi>v</mi></mrow></m
   sub><msub><mi
   mathvariant="bold">x</mi><mi>j</mi></msub></mfenced><mo>,</mo></mrow>
   :MATH]

   where W[c,l] is another weighing matrix and α[c,i,j] is the attention
   coefficient, which is computed by:
   [MATH:
   <mrow><msub><mi>α</mi><mrow><mi>c</mi><mo>,</mo><mi>i</mi><mo>,</mo><mi
   >j</mi></mrow></msub><mo>=</mo><mrow><mi
   mathvariant="normal">Softmax</mi><mfenced close=")"
   open="("><mfrac><mrow><msup><mrow><mfenced close=")" open="("><msub><mi
   mathvariant="bold">W</mi><mrow><mi>c</mi><mo>,</mo><mi>q</mi></mrow></m
   sub><msub><mi
   mathvariant="bold">x</mi><mi>i</mi></msub></mfenced></mrow><mi
   mathvariant="normal">T</mi></msup><mrow><mo
   stretchy="false">(</mo><msub><mi
   mathvariant="bold">W</mi><mrow><mi>c</mi><mo>,</mo><mi>k</mi></mrow></m
   sub><msub><mi mathvariant="bold">x</mi><mi>j</mi></msub><mo
   stretchy="false">)</mo></mrow></mrow><msqrt><mi>d</mi></msqrt></mfrac><
   /mfenced></mrow><mo>,</mo></mrow> :MATH]

   where d is the dimension of x[i]. After being updated by each attention
   head, the node feature is concatenated by:
   [MATH: <mrow><msubsup><mi
   mathvariant="bold">x</mi><mrow><mi>i</mi></mrow><mo>′</mo></msubsup><mo
   >=</mo><msubsup><mo
   stretchy="false">‖</mo><mrow><mi>c</mi><mo>=</mo><mn>1</mn></mrow><mi>C
   </mi></msubsup><msub><mrow><mo stretchy="false">(</mo><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>c</mi><mo>,</mo><mi>i</mi></m
   row></msub><mrow><mo>′</mo><mo
   stretchy="false">)</mo><mo>,</mo></mrow></mrow> :MATH]

   where C is the total number of attention heads and ∥ is the operator of
   concatenating the vectors. Subsequently, such two methods as global
   mean pooling and global max pooling are adopted in parallel in the
   Global Pooling layer. Finally, through a fully-connected layer, a
   molecule is represented by a 2000-dimensional vector.

   Block2 In this block, the SMILES string of a molecule is input in text
   form. The neural network of MolecularTransformerEmbeddings [[81]33] is
   used to obtain the molecular representation vector, in which every
   character has its specific attention weights. Therefore, this vector
   encodes the structural characteristics of the molecule. Subsequently,
   the vector passes through a one-dimensional convolution layer and a
   one-dimensional pooling layer. Finally, similar to Block1, a molecule
   is represented by a 2000-dimensional vector through a fully-connected
   layer.

   Extracted from the two blocks, the features of a molecule are
   concatenated as a 4000-dimensional vector. The information is learned
   based on a CNN akin to Block2 and a dropout layer is employed to
   prevent the model over-fitting. As the output layer, the last fully
   connected layer (Last FC) provides a set of probabilities that the
   compound belongs to each metabolic pathway. The final prediction is
   made by utilizing a threshold value of 0.5. As for a bit-string of
   ‘0001…1000’ given, ‘1’ at the pth position reflects that the metabolite
   belongs to the pth metabolic pathway, while ‘0’ suggests that it does
   not belong to that pathway.

   To get a pre-trained model for subsequent transfer learning, several
   variant models were built for performance comparison. Different ways of
   message passing between nodes in graphs affect the performance of GNN.
   Therefore, three variants were created by replacing the GNN layer in
   Block1 while keeping the small convolution kernel in Block2 unchanged
   including GCN combining with CNN (GCN + CNN), GAT combining with CNN
   (GAT + CNN), SuperGAT [[82]34] combining with CNN (SuperGAT + CNN). It
   should be noted that common GCN and GAT have been previously utilized
   in the prediction of metabolic pathways, whereas SuperGAT has not been
   applied to this task before. Additionally, a fivefold cross-validation
   was conducted. The model performance is evaluated by accuracy,
   precision, recall, and F1_score. Here, we choose micro F1_score.
   Additional file [83]1 provides an exhaustive description and discussion
   of these evaluation metrics, including implementation specifics.

Transfer learning

   GTC undergoes an initial pre-training process on Dataset C to acquire
   knowledge, which is subsequently transferred to predict plant secondary
   metabolic pathways (Fig. [84]1). During the fivefold cross-validation,
   the model with the highest accuracy is chosen. As described in
   Fig. [85]2, there are three common strategies for transfer learning. In
   the first pattern, all the pre-trained weights are frozen and only the
   last fully connected layer (output layer) is changed. The other two
   patterns use fine-tuning, which involves updating a deployed learning
   model with a smaller learning rate. In the second one, some layers are
   fine-tuned and the others are frozen. In the third strategy, the entire
   model is fine-tuned with the pre-trained weights. Based on these three
   patterns, five concrete transfer strategies have been implemented as
   follows:
     * Entire model frozen: the pre-trained weights were frozen in the
       entire model.
     * Block1 frozen: the pre-trained weights were frozen in Block1.
     * Block2 frozen: the pre-trained weights were frozen in Block2.
     * Block1 + Block2 frozen: the pre-trained weights were frozen in both
       Block1 and Block2.
     * No module frozen: fine-tune the entire model with the pre-trained
       weights.

Fig. 2.

   Fig. 2
   [86]Open in a new tab

   Three common strategies utilized in transfer learning. a All the
   pre-trained weights are frozen and only the last fully connected layer
   (output layer) is changed. b Some layers are fine-tuned and the others
   are frozen. c The entire model is fine-tuned with the pre-trained
   weights

   The learning rate decreases from ‘0.0003’ (default) to ‘0.0001’ in the
   five transfer strategies. In addition, as for the three methods of
   Block1 frozen, Block2 frozen and Block1 + Block2 frozen, the
   pre-trained weights in the fully connected layer are not frozen.

Results and discussion

Performance comparison of several machine learning models on the benchmark
dataset

   Four deep learning models are constructed, including GTC and its three
   variant models: GCN + CNN, GAT + CNN, and SuperGAT + CNN. Their
   performances are compared with previously reported methods. Among the
   related research, only two studies have made their codes publicly
   available. One study [[87]21] proposed RF-based and GCN-based models
   for metabolic pathway prediction. The other study developed MLGL-MP
   [[88]24], which utilized both GAT and GCN for the compound encoder and
   GCN for the pathway encoder. We use their source codes to retrain the
   three aforementioned models. Five-fold cross-validation experiments are
   conducted for all seven models.

   As seen in Table [89]1, GTC outperforms the other models, yielding an
   accuracy of 96.75%, precision of 85.14%, recall of 83.03%, and F1_score
   of 84.06%. A two-tailed student t-test is performed to compare GTC with
   each of the other six models. The results show that the four scores of
   GTC are statistically significantly better than the three previously
   published methods, except for precision with MLGL-MP. As for our
   models, significant differences exist between GTC and GCN + CNN.
   However, no significant differences are observed between GTC and
   GAT + CNN or between GTC and SuperGAT + CNN, respectively. It should be
   noted that both GAT + CNN and SuperGAT + CNN use the attention
   mechanism in their graph neural network. It implies that the attention
   mechanism plays a crucial role in feature extraction and aggregation in
   the graph. The attention coefficient in graph transformer is calculated
   more effectively, particularly for our task.

Table 1.

   Performance of GTC and other machine learning models
   Methods Accuracy (%) Precision (%) Recall (%) F1_score (%)
   RF-based 95.66 ± 0.16*** 70.95 ± 0.89*** 69.38 ± 0.84***
   70.16 ± 0.86***
   GCN-based 95.94 ± 0.14*** 80.81 ± 0.36*** 79.47 ± 1.20**
   80.14 ± 0.71***
   MLGL-MP 96.45 ± 0.15* 84.67 ± 1.36 80.10 ± 0.78** 82.32 ± 0.60*
   GCN + CNN 96.05 ± 0.14*** 82.26 ± 0.66** 78.67 ± 1.53** 80.41 ± 0.75***
   GAT + CNN 96.51 ± 0.27 83.99 ± 1.73 81.74 ± 0.73 82.84 ± 1.10
   SuperGAT + CNN 96.63 ± 0.24 84.30 ± 1.46 82.75 ± 0.81 83.52 ± 1.06
   GTC 96.75 ± 0.21 85.14 ± 1.42 83.03 ± 0.92 84.06 ± 0.85
   [90]Open in a new tab

   The values represent the mean ± standard deviation obtained through
   fivefold cross-validation. The best value on the metric is highlighted
   in bold; The symbols *, **, and *** mean that the performance of GTC is
   significantly better in the t-test at the p-values less than 0.05,
   0.01, and 0.001, respectively.

Ablation studies

   For an ensemble learning model, ablation studies are significant to
   study the contribution of each component to the whole system. To
   evaluate the effectiveness of the components of GTC, ablation studies
   were conducted on three variants of the model proposed as follows. The
   first variant removed CNN layers after concatenation of the vectors
   learned by Block1 and Block2. The second and the third were in lack of
   Block2 and Block1, respectively.

   As seen in Fig. [91]3, the intact GTC model achieves the best
   performance over all four evaluation metrics. Each of the three
   components contributed differently to the learning model. The removal
   of CNN layers after the concatenation of the two vectors doesn’t have a
   significant impact. Since features are extracted through Block1 and
   Block2, a fully connected layer could complete a decent output result.
   A similar performance is produced for the removal of Block2. However,
   removing Block1 results in a drastic underperformance. Such results
   indicate that the graph transformer network plays a crucial role in our
   model. In general, all components contribute to the prediction task in
   varying degrees, so none of them should be eliminated.

Fig. 3.

   Fig. 3
   [92]Open in a new tab

   The effects of Block1, Block2, and CNN layers on performance efficacy
   after the concatenation of vectors learned by Block1 and Block2,
   respectively

Performance of transfer learning

   To evaluate the effectiveness of transfer learning, we perform the
   fivefold cross-validation experiments on Dataset A, using the five
   strategies described in the section Transfer learning, Methods. It can
   be seen that freezing the pre-trained weights in Block2 (referred to as
   “Block2 frozen”) yields an accuracy of 98.30%, precision of 88.44%,
   recall of 84.36%, and F1_score of 86.34% (Table [93]2). All four
   metrics of accuracy, precision, recall, and F1_score achieve the best,
   suggesting that “Block2 frozen” represents the most effective model for
   our task. Among the five transfer training strategies, fine-tuning the
   entire model and fine-tuning the model with Block2 frozen performed
   extremely close in terms of the four evaluation metrics. The input of
   Block2 is the embedding results through the neural network named
   MolecularTransformerEmbeddings [[94]33]. After pre-training on Dataset
   C, Block2 was adjusted to the task of multi-label prediction.
   Retraining Block2 is unnecessary due to the increase of additional
   computational costs and time consumption without substantial gains in
   performance. A similar conclusion is also drawn from comparing the
   freezing of Block1 and the freezing of both Block1 and Block2. However,
   freezing Block1 adversely affects model performance, which is
   consistent with the results of the ablation study. Due to the core role
   of Block1, fine-tuning Block1 is recommended to adapt it to the new
   prediction task. It should also be noted that freezing the entire model
   degraded model performance abruptly, possibly because the chemical
   structures and label distribution differ drastically between Dataset A
   and Dataset B. Therefore, the strategy of fine-tuning is a better
   option. In summary, it is an appropriate strategy to fine-tune the
   transfer model with the pre-trained weights in Block2 frozen, in terms
   of predicting plant secondary metabolic pathways.

Table 2.

   Performance of different transfer strategies for plant secondary
   metabolic pathway prediction
   Strategy Accuracy (%) Precision (%) Recall (%) F1_score (%)
   Entire model frozen 97.04 ± 0.18 80.23 ± 0.99 70.90 ± 2.17 75.26 ± 1.44
   Block1 frozen 98.16 ± 0.20 86.99 ± 2.29 83.49 ± 1.10 85.20 ± 1.53
   Block2 frozen 98.30 ± 0.15 88.44 ± 1.59 84.36 ± 1.17 86.34 ± 1.07
   Block1 + Block2 frozen 98.15 ± 0.19 86.99 ± 2.26 83.45 ± 0.94
   85.17 ± 1.40
   No module frozen 98.29 ± 0.15 88.23 ± 1.55 84.32 ± 1.31 86.22 ± 1.11
   [95]Open in a new tab

   The values represent the mean ± standard deviation obtained through
   fivefold cross-validation. The best value on the metric is highlighted
   in bold

   Then, the same strategy is implemented to conduct the final test on
   Dataset B, which is not involved in model training. All the 2028
   molecules in Dataset A are used as the training set. After 300 epochs,
   the prediction model is saved, through which the prediction results are
   obtained. Moreover, we investigate the influences of data augmentation,
   an alternative method often used to address data insufficiency and
   imbalance. Here, it is hypothesized that the two molecules with high
   structural similarity would be involved in the same metabolic pathway.
   Data augmentation methods α and β involve the replacement of methoxy
   groups with hydroxy groups and the replacement of hydroxy groups with
   methoxy groups [[96]29]. α is augmentation once and β is applied three
   times. Data augmentation methods γ and θ leverage molecular fingerprint
   similarity, calculating the Tanimoto coefficient between the 2028
   molecules in Dataset A and the molecules in the COCONUT database
   [[97]35]. γ and θ choose fingerprint similarity between 0.99–1 and
   0.96–1, respectively. The value of 0.96 represents the average Tanimoto
   coefficient in the datasets created by α and β. As described in Table
   [98]3, the transfer learning approach exhibits decent performance. The
   accuracy is 97.82%, while precision, recall, and F1_score are all at
   83.74%. In contrast, the impact of four data augmentation is
   negligible. However, it is noteworthy that the recall value of 84.32%
   indicates a decrease in the number of false negatives when the number
   of training data pieces reached 52,530. Thus, considering both model
   performance and computational cost, transfer learning outperformed data
   augmentation in predicting plant secondary metabolic pathways.

Table 3.

   Performance comparison between transfer learning and several data
   augmentation methods for plant secondary metabolic pathway prediction
Methods             Compounds Accuracy (%) Precision (%) Recall (%) F1_score (%)
Transfer Learning   2028         97.82         83.74       83.74       83.74
Data augmentation α 7666         97.80         83.10       84.17       83.63
Data augmentation β 52,530       97.70         81.84       84.32       83.06
Data augmentation γ 9846         97.50         81.02       81.73       81.37
Data augmentation θ 17,927       97.54         81.13       82.30       81.71
   [99]Open in a new tab

Application in natural product classification

   GTC has been demonstrated to be effective in extracting molecular
   features. We also apply it to other prediction tasks to assess its
   generalization ability. Here, GTC is employed to classify natural
   products, utilizing Dataset D in the fivefold cross-validation
   experiments. Since the data size is sufficient, transfer learning is
   not employed. The entire model is trained from scratch.

   The metrics in Table [100]4 demonstrate that GTC could effectively
   classify the natural products into their related pathways in
   NPClassifier. The highest accuracy score was 100.00% for alkaloids and
   the lowest score was 98.42% for shikimates and phenylpropanoids. The
   results imply that GTC could extract molecular features for various
   applications, such as the creation of novel molecular descriptors and
   molecular property prediction.

Table 4.

   Performance of GTC in natural product classification
   Category Accuracy (%) Precision (%) Recall (%) F1_score (%)
   Alkaloids 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00
   Amino acids and Peptides 98.91 ± 0.22 98.98 ± 0.32 99.80 ± 0.17
   99.39 ± 0.12
   Carbohydrates 99.06 ± 0.17 99.24 ± 0.25 99.64 ± 0.26 99.44 ± 0.11
   Fatty acids 98.85 ± 0.28 98.97 ± 0.42 99.63 ± 0.16 99.30 ± 0.18
   Polyketides 98.85 ± 0.28 98.34 ± 0.62 99.48 ± 0.21 98.90 ± 0.26
   Shikimates and Phenylpropanoids 98.42 ± 0.55 97.98 ± 1.33 98.38 ± 1.01
   98.17 ± 0.62
   Terpenoids 99.09 ± 0.16 98.41 ± 0.95 98.55 ± 1.13 98.47 ± 0.29
   [101]Open in a new tab

   The values represent the mean ± standard deviation on fivefold
   cross-validation. The best value on the metric is highlighted in bold,
   while the lowest value on the metric is underlined

The EXE program of plant secondary metabolic pathway prediction.

   To make the model more accessible, an executable (EXE) program with a
   graphical user interface (GUI) has been developed as shown in
   Additional file [102]1: Fig. S4. Users can input the SMILES string of a
   query compound, and the program executes the prediction task,
   displaying the results in the interface and saving it to a text file.
   The detailed process is demonstrated in Additional file [103]1: Fig.
   S5.

Conclusions

   In this study, we propose GTC, a deep transfer learning model that
   employs a pre-trained hybrid deep learning architecture and transfer
   learning for the prediction of plant secondary metabolic pathways. GTC
   mainly consists of two blocks, which leverage a graph transformer-based
   graph neural network and a convolution neural network to
   comprehensively learn molecular features. By combining these two
   representations, GTC provides a more comprehensive representation of
   molecules, leading to superior performance compared to state-of-the-art
   methods in predicting metabolic pathways. GTC is pre-trained on the
   KEGG dataset and then fine-tuned to complete the accurate prediction of
   plant secondary metabolic pathways. The modular architecture of GTC
   allows for easy adaptation to other molecular prediction tasks beyond
   metabolic pathway classification, such as classifying natural products,
   by modifying the output layer accordingly. This study highlights the
   potential of deep learning in pathway prediction and offers an
   accessible tool for predicting plant secondary metabolic pathways.

Supplementary Information

   [104]12859_2023_5485_MOESM1_ESM.pdf^ (466.2KB, pdf)

   Additional file1. Evaluation metrics, implementation details and
   supplementary figures.
   [105]Additional file2. Supplementary tables.^ (3.8MB, xlsx)

Abbreviations

   CNN
          Convolution neural network

   EXE
          Executable

   FC
          Fully connected

   GAT
          Graph attentional network

   GCN
          Graph convolutional network

   GNN
          Graph neural network

   GTC
          Graph transformer and CNN

   GUI
          Graphical user interface

   KEGG
          Kyoto Encyclopedia of Genes and Genomes

   PWN
          Plant metabolic network

   RF
          Random forest

   SMILES
          Simplified molecular-input line-entry system

   SVM
          Support vector machine

Author contributions

   All authors participated in the design of the study. Pipeline and
   framework design: H.B and X.L. Data extraction: H.B and J.Z. Code
   writing: H.B and J.Z. Drafting the manuscript: H.B. Revision of the
   manuscript: X.L, G.X, X.Z, C.Z. All authors read and approved the final
   manuscript.

Funding

   This research was supported by the key foundation (21934006) and
   foundations (22074145, 22174141, and 22274151) from the National
   Natural Science Foundation of China, and the Innovation Program (DICP
   I202004) of Science and Research from Dalian Institute of Chemical
   Physics, CAS, China, and the AI S&T Program (Grant No. DNL-YL A202202)
   from Yulin Branch, Dalian National Laboratory For Clean Energy, CAS,
   China.

Availability of data and materials

   The EXE program of Plant Secondary Metabolic Pathway Prediction.exe is
   freely available to academics at
   [106]https://github.com/ucasaccn/PSMPP.

Declarations

Ethics approval and consent to participate

   Not applicable.

Consent for publication

   Not applicable.

Competing interests

   The authors declare that they have no competing interests.

Footnotes

   Publisher's Note

   Springer Nature remains neutral with regard to jurisdictional claims in
   published maps and institutional affiliations.

Contributor Information

   Xin Lu, Email: luxin001@dicp.ac.cn.

   Guowang Xu, Email: xugw@dicp.ac.cn.

References