Abstract Objective To identify plasma lipid characteristics associated with premetabolic syndrome (pre-MetS) and metabolic syndrome (MetS) and provide biomarkers through machine learning methods. Methods Plasma lipidomics profiling was conducted using samples from healthy individuals, pre-MetS patients, and MetS patients. Orthogonal partial least squares-discriminant analysis (OPLS-DA) models were employed to identify dysregulated lipids in the comparative groups. Biomarkers were selected using support vector machine recursive feature elimination (SVM-RFE), random forest (rf), and least absolute shrinkage and selection operator (LASSO) regression, and the performance of two biomarker panels was compared across five machine learning models. Results In the OPLS-DA models, 50 and 89 lipid metabolites were associated with pre-MetS and MetS patients, respectively. Further machine learning identified two sets of plasma metabolites composed of PS(38:3), DG(16:0/18:1), and TG(16:0/14:1/22:6), TG(16:0/18:2/20:4), and TG(14:0/18:2/18:3), which were used as biomarkers for the pre-MetS and MetS discrimination models in this study. Conclusion In the initial lipidomics analysis of pre-MetS and MetS, we identified relevant lipid features primarily linked to insulin resistance in key biochemical pathways. Biomarker panels composed of lipidomics components can reflect metabolic changes across different stages of MetS, offering valuable insights for the differential diagnosis of pre-MetS and MetS. Keywords: machine learning, nontargeted lipidomics, premetabolic syndrome, metabolic syndrome, biomarkers 1. Background MetS comprises a cluster of “cardiometabolic risk” factors, including high blood sugar, hypertension, hypertriglyceridemia, low high-density lipoprotein cholesterol, and abdominal obesity. Pre-MetS denotes a set of clinical and biochemical features manifesting metabolic irregularities in specific aspects, albeit not fully meeting the diagnostic criteria for MetS ([33]1–[34]4). The combined impact of these components and ongoing metabolic disruptions significantly increase the risk of cardiovascular disease (CVD) ([35]3) and cancer ([36]4). According to previous research ([37]5), the risk of CVD in pre-MetS is 1.5 to 2.3 times higher than that in individuals without MetS components, while MetS increases the risk by 3.44 to 4.42 times. As of now, the most widely accepted diagnostic criteria for MetS include those established by the National Cholesterol Education Program Adult Treatment Panel III (NCEP ATP III) ([38]6), the International Diabetes Federation (IDF) ([39]7), and the Joint Commission of the China Adult Dyslipidemia Control Guide (JCDCG) ([40]8) in China. Among these, the IDF criteria stipulate abdominal obesity as a prerequisite, while the JCDCG criteria incorporate postprandial blood glucose into the definition of hyperglycemia. Furthermore, the revised ATP III criteria enhance screening for individuals at high risk by lowering the diagnostic threshold for fasting blood glucose to 5.6 mmol/L. Compared to other criteria, the revised ATP III criteria are more straightforward and efficient, offering advantages in capturing individuals with metabolic abnormalities in large-scale community screening. Global prevalence rates for MetS (IDF criteria) and pre-MetS (IDF criteria) were reported to be 16.46% and 14.72%, respectively ([41]9). Prior investigations indicated that the incidence of MetS stabilized after the age of 46 ([42]10), and the contribution of each metabolic factor associated with MetS was not equal ([43]5). A cross-sectional study showed that the most common risk factors for pre-MetS and MetS are hypertension and abdominal obesity ([44]11), while another small-scale study revealed a higher prevalence of high triglycerides and hypertension ([45]12). A recent cohort study assessed the relative contributions of four major MetS risk factors in a large population, ranked from highest to lowest as high blood sugar, hypertension, dyslipidemia, and obesity ([46]13). Metabolic phenotypes observed in MetS patients with hyperglycemia are similar to those with all four risk factors, indicating that individuals with hyperglycemia and hypertension are more predisposed to developing cardiovascular and cerebrovascular diseases. Many studies combined machine learning with lifestyle-related and anthropometric features to detect and prevent MetS ([47]11), yet the mechanisms underlying the development of MetS remain incompletely understood ([48]6). However, research suggests that insulin resistance, disturbances in glucose and lipid metabolism, and chronic inflammation interact through multiple signaling mechanisms, with abnormal lipid metabolism being a common denominator ([49]14, [50]15). The clustered metabolic disruptions in MetS lead to worsening lipid metabolism abnormalities, eventually culminating in significant cardiovascular disease. Thus, apart from clinical markers, lipidomics is employed to discover diagnostic and prognostic biomarkers associated with MetS, enhancing our understanding of its etiology. For instance, a Dutch study found that approximately 100 lipids, mainly triglycerides, were positively correlated with MetS, while 10 lipids were negatively correlated ([51]16). Given the escalating global prevalence of MetS, early identification of at-risk individuals and predicting patient responses to treatment is vital. The development of novel biomarkers for MetS has potential for use in diagnosis and treatment of this disorder. Researchers have extensively screened population and clinical features for predicting MetS ([52]17) and identifying related factors ([53]18). However, no study has deeply investigated changes in lipid metabolites across different physiological states of pre-MetS and MetS. Thus, gaining a deeper understanding of lipid changes could aid in establishing monitoring programs for pre-MetS and MetS, ultimately reducing the incidence of cardiovascular disease. This study aims to construct optimal pre-MetS and MetS identification models through a combination of machine learning techniques and nontargeted lipidomics, contributing to preventive health care in the population. 2. Materials and methods 2.1. Study design and participants Between March 2021 and June 2021, a multistage stratified cluster random sampling method was used to select residents undergoing routine health check-ups from 18 villages in 6 towns in Jin’an District, Fuzhou City. A preliminary survey was conducted with a response rate of 95.75%, involving 1,800 permanent residents who had lived in the area for at least 6 months. The inclusion criteria were as follows: (1) age ≥ 18 years; and (2) exclusion of individuals with coronary heart disease, myocardial infarction, angina pectoris, stroke, malignancy, chronic obstructive pulmonary disease, chronic urinary system diseases (e.g., stones, prostatitis, chronic nephritis), or missing baseline data. A total of 8,715 individuals met these criteria. 28 MetS patients were enrolled and matched 1:1 and 2:1 by sex and age with pre-MetS and normal individuals, respectively, resulting in a final study cohort of 70 participants. 2.2. Variable definitions and survey content MetS diagnosis followed the revised ATP III ([54]6), where participants were defined as having MetS if they had any three of the following five phenotypes: (1) systolic blood pressure (SBP) ≥ 130 mmHg and/or diastolic blood pressure (DBP) ≥ 85 mmHg; (2) triglycerides (TG) ≥ 1.7 mmol/L; (3) fasting plasma glucose (FPG) ≥ 5.6 mmol/L; (4) high-density lipoprotein cholesterol (HDL-C) < 1.03 mmol/L for men or < 1.29 mmol/L for women; and (5) abdominal obesity defined as waist circumference (WC) ≥ 90 cm for men or ≥ 85 cm for women. Pre-MetS was defined as having one or two MetS components. A self-designed unified questionnaire was used to collect information on personal health status, medical history, and lifestyle behaviors (exercise, smoking, alcohol consumption, sleep). Physical examinations included height, weight, waist circumference (measured twice and averaged), and blood pressure measurements (measured thrice using UR-9000F). Laboratory biochemical tests were conducted on venous blood collected from participants in a fasting state. Serum total cholesterol (TC), low-density lipoprotein cholesterol (LDL-C), HDL-C, TG, and FBG were measured using enzymatic colorimetric methods. Serum uric acid (SUA), creatinine (Cre) and blood urea nitrogen (BUN) levels were measured using a colorimetric method on a Hitachi 7100 automatic biochemistry analyzer. 2.3. Nontargeted lipidomics analysis After fasting for at least 12 hours, morning venous blood samples were collected from all participants using venipuncture, and the samples were stored at -80°C until further nontargeted lipidomics analysis. The lipidomics contents were measured at Shanghai Applied Technology Co., Ltd., China ([55]http://www.aptbiotech.com/). The project utilizes a nontargeted lipidomics analysis platform based on the UPLC-Orbitrap mass spectrometry system from China New Life Technology Co., Ltd. Lipid identification and data preprocessing are carried out using LipidSearch software by Thermo Scientific™. Preparation of quality control (QC) samples involves combining equal amounts of samples from each group to create the QC mixture. QC samples serve not only to assess instrument status and chromatography−mass spectrometry system equilibration before injection but also to evaluate the overall experimental system stability. Sample preprocessing involved thawing samples on ice, vortex-mixing, and transferring 100 μL to a 1.5 mL centrifuge tube. Subsequently, 200 μL of 4°C water was added, followed by vortex mixing. Next, 240 μL of prechilled methanol was added and mixed by vortexing, and then 800 μL of MTBE was added and mixed by vortexing. The mixture was subjected to 20 minutes of ultrasonication in a low-temperature water bath, followed by 30 minutes of room-temperature incubation. Afterward, centrifugation at 14,000 g and 10°C for 15 minutes was performed, and the upper organic phase was collected. The samples were dried using nitrogen gas and stored at -80°C. Chromatographic separation employed the UHPLC Nexera LC-30A ultrahigh-performance liquid chromatography system. The column temperature was set at 45°C, and the flow rate was 300 μL/min. The mobile phase consisted of two components: A - 10 mM ammonium formate in acetonitrile-water solution (acetonitrile:water = 6:4, v/v) and B - 10 mM ammonium formate in acetonitrile-isopropanol solution (acetonitrile:isopropanol = 1:9, v/v). The gradient elution program was as follows: 0-2 minutes, B was held at 30%; 2-25 minutes, B linearly changed from 30% to 100%; and 25-35 minutes, B was held at 30%. Throughout the analysis, samples were kept in an autosampler at 10°C. To mitigate the impact of instrument signal fluctuations, samples are analyzed in a randomized sequence. Mass spectrometric separation was conducted using both electrospray ionization positive and negative ion modes. After UHPLC separation, analysis was performed using a QExactive Plus mass spectrometer (Thermo Scientific™). 2.4. Data analysis Data were double-entered using EpiData 3.1 software, and statistical analysis was performed using SPSS 26.0 and R 4.2.2 software. For normally distributed data, the mean ± standard deviation (x̄ ± s) is used, while for non-normally distributed data, the median (upper quartile, lower quartile) is used, represented as Median (M), quartile range (P25, P75). Group differences are compared using analysis of variance (ANOVA) or non-parametric tests. Count data are presented as composition ratios and rates (n, %), and group differences are analyzed using chi-square tests. Lipid identification, peak extraction, and lipid characterization were performed using Lipid Search. Univariate analysis was conducted on the extracted data, and volcano plots were used for visualization. Prior to evaluating the predictive performance of various machine learning methods, data from each group underwent exploratory multivariate statistical analysis using seven-fold cross-validation and OPLS-DA, including normalization, logarithmic transformation, and autoscaling, to examine potential outliers or systematic variations (FDR < 0.05). The variable importance for the projection (VIP) values were used to measure the influence strength and explanatory power of each lipid molecule on sample classification discrimination in each group. Lipid molecules with VIP > 1 significantly contribute to the model interpretation. Lipid molecules with VIP > 1.5, P < 0.05, and FC > 1.5 were selected as significantly different based on the criteria. The machine learning models in this study included generalized linear model (glm), recursive partitioning and regression (rpart), random forest (rf), linear discriminant analysis (lda), and prediction analysis for microarrays (pam). Before evaluating the predictive performance of various machine learning methods, exploratory multivariate statistical data analysis using OPLS-DA was conducted on normalized, logarithmically transformed, and autoscaled data from each group to check for potential outliers or systematic changes (FDR < 0.05). The variable intersection of support vector machine recursive feature elimination (SVM-RFE), rf, and least absolute shrinkage and selection operator (LASSO) regression was applied to each pairwise comparison (control vs. pre-MetS and pre-MetS vs. MetS) to identify the most discriminative variables. After selecting variables, five machine learning models were established. Validation was performed using 7-fold cross-validation, and during model development, 10-fold cross-validation was used for training and testing to obtain optimal parameters. In the model development process, adjustments were made to the hyperparameters of each algorithm (such as cost values, kernel functions, and the number of trees in the training dataset). Therefore, using the best hyperparameters, our model was trained and tested on six folds and validated on the remaining fold, repeated seven times across the entire dataset ([56] Figure 1 ). Figure 1. [57]Figure 1 [58]Open in a new tab Study design and data analysis workflow. 3. Results 3.1. Clinical characteristics We tested 1361 nontargeted lipid metabolites in the plasma of patients with pre-MetS or MetS. The important sociodemographic factors and laboratory tests for each participant are reported in [59]Table 1 . Results from ANOVA and Chi-square tests indicated no statistically significant differences (P > 0.05) in gender, age, education level, marital status, occupation type, smoking status, alcohol consumption, exercise habits, TC, LDL-C, Cre, and BUN among the groups ([60] Table 1 ). Table 1. Baseline. Characteristics Total (n = 70) Normal (n = 14) pre-MetS(n = 28) MetS(n = 28) P Gender 1.000 Male 40 (57.1) 8 (57.1) 16 (57.1) 16 (57.1) Female 30 (42.9) 6 (42.9) 12 (42.9) 12 (42.9) Age(years) 53.61 ± 8.1 54.29 ± 10.36 52.96 ± 7.89 53.93 ± 7.29 0.856 Education 0.320 High school and above 2 (6.9) 1 (20.0) 0 (0) 1 (8.3) Below high school 27 (93.1) 4 (80.0) 12 (100) 11 (91.7) Marital status 0.799 Married/cohabiting 67 (95.7) 14 (100) 27 (96.4) 26 (92.9) Divorced/widowed/separated 3 (4.3) 0 (0) 1 (3.6) 2 (7.1) Occupations 0.337 Brain work 12 (17.1) 2 (14.3) 7 (25.0) 3 (10.7) Physical labor 43 (61.4) 11 (78.6) 14 (50.0) 18 (64.3) Retired/unemployed 15 (21.4) 1 (7.1) 7 (25.0) 7 (25.0) Smoking status 0.844 Never 48 (68.6) 10 (71.4) 18 (64.3) 20 (71.4) Smoking/quitting 22 (31.4) 4 (28.6) 10 (35.7) 8 (28.6) Alcohol consumption 0.199 No alcohol/moderate drinking 63 (90.0) 14 (100) 26 (92.9) 23 (82.1) Alcohol abuse 7 (10.0) 0 (0) 2 (7.1) 5 (17.9) Exercise habits 0.767 Medium to high intensity exercise 29 (41.4) 7 (50.0) 11 (39.3) 11 (39.3) Lack of exercise 41 (58.6) 7 (50.0) 17 (60.7) 17 (60.7) Sleep duration(h/d) 0.046 7-8 42 (60.0) 9 (64.3) 21 (75.0) 12 (42.9) <7 or >9 28 (40.0) 5 (35.7) 7 (25.0) 16 (57.1) BMI (kg/m^2) 24.62 ± 3.74 23.41 ± 3.23 23.8 ± 3.66 26.05 ± 3.7 0.029 WC (cm) 83.82 ± 9.68 76.84 ± 4.53 80.22 ± 9.4 89.79 ± 7.89 < 0.001 SBP (mmHg) 127.3 (116.4, 136.2) 119.3 (115.3, 128.4) 119 (114.6, 131.5) 136.2 (127.8, 142.4) < 0.001 DBP (mmHg) 79.77 ± 10.07 73.14 ± 8.93 76.73 ± 7.73 86.13 ± 9.32 < 0.001 FBG (mmol/L) 5.7 (5.3, 6.5) 5.3 (5.1, 5.5) 5.7 (5.2, 6.5) 6.1 (5.7, 6.7) < 0.001 OGTT-2h(mmol/L) 11.3 (6.1, 12) 6.1 (5.2, 7.2) 8.8 (5.7, 11.5) 12 (11.4, 13.3) < 0.001 TC (mmol/L) 5.04 ± 0.87 4.68 ± 0.52 4.93 ± 0.89 5.33 ± 0.92 0.052 TG (mmol/L) 1.4 (1.0, 1.9) 0.7 (0.6, 0.9) 1.2 (1, 1.6) 2.2 (1.7, 3.4) < 0.001 HDL-C (mmol/L) 1.25 ± 0.28 1.43 ± 0.21 1.27 ± 0.27 1.15 ± 0.28 0.007 LDL-C (mmol/L) 3.03 ± 0.77 3.04 ± 0.44 3.02 ± 0.93 3.03 ± 0.75 0.997 SUA (μmmol/L) 356 (302.6, 421.1) 321 (234.5, 359.2) 344.5 (300, 405) 387 (329.2, 448.5) 0.017 Cre (mmol/L) 69.65 ± 12.74 67.26 ± 16.03 68.39 ± 10.89 72.1 ± 12.71 0.412 BUN (mmol/L) 4.94 ± 1.28 4.81 ± 1.07 4.87 ± 1.19 5.08 ± 1.47 0.762 [61]Open in a new tab Data are presented as the means ± SDs or frequencies (percentages). MetS metabolic syndrome, SD standard deviation, WC waist circumference, SBP systolic blood pressure, DBP diastolic blood pressure, BMI body mass index, FPG fasting plasma glucose, OGTT-2h Oral Glucose Tolerance Test - 2 hours, TC total cholesterol, TG triglycerides, HDL-C high-density lipoprotein cholesterol, LDL-C low-density lipoprotein cholesterol, Cre creatinine, BUN blood urea nitrogen. P < 0.05 was considered statistically significant. 3.2. Identification of differentially expressed lipids To investigate the role of lipids in the pathogenesis of pre-MetS and MetS, we performed subsequent analysis using the expression profiles of nontargeted lipidomics from the plasma of pre-MetS patients compared to healthy controls and MetS patients. Differential expression analysis of the 1,361 lipid expression profiles revealed that there were 77 significantly upregulated lipids in pre-MetS patients compared to healthy controls and 141 significantly upregulated lipids in pre-MetS patients compared to MetS patients. Additionally, there were 2 significantly downregulated lipids in pre-MetS patients compared to MetS patients ([62] Figures 2A, B ). VIP values were calculated for each metabolite through the OPLS-DA model, and metabolites with VIP values > 1.5 were considered the most important. The number of latent variables in the OPLS-DA model was chosen based on sevenfold cross-validation. OPLS-DA score plots demonstrated separation between pre-MetS patients and healthy controls, as well as between pre-MetS patients and MetS patients ([63] Figures 2C, D ). The cumulative R2Y values from the OPLS-DA model were 0.709 and 0.589, and the cumulative Q2 values were 0.453 and 0.342 for the pre-MetS vs. control and pre-MetS vs. MetS comparisons, respectively. From the 1,361 candidate metabolites, 50 and 89 metabolites were selected as candidates based on VIP > 1.5, FDR < 0.05, and log[2]|FC| > 1 ([64] Supplementary Tables 1 , [65]2 ). Figure 2. [66]Figure 2 [67]Open in a new tab Identification of lipids related to pre-MetS and MetS. (A) Volcano plot of candidate lipid metabolism biomarkers in the pre-MetS group. (B) Volcano plot of candidate lipid metabolism biomarkers in the MetS group. (C) Orthogonal Partial Least Squares Discriminant Analysis (OPLS-DS) score plot between the pre-MetS and Normal groups. (D) OPLS-DS score plot between the MetS and pre-MetS groups. Lipid metabolites colored by their chemical categories. Multivariate analysis was conducted using a seven-fold cross-validation method. 3.3. Feature selection using LASSO, rf and SVM-RFE Three algorithms—LASSO, rf and SVM-RFE—were employed to select the core lipid features associated with pre-MetS patients. For SVM-RFE, to prevent overfitting, when including three features, PE(18:0/18:1), PS(38:3), and DG(16:0/18:1), the classifier accuracy reached a maximum value, and the error was minimized ([68] Figures 3A, B ). Using rf, 15 lipids were identified with relative importance >0.4, including: PE(18:0/18:1), PS(38:3), DG(36:2p), DG(33:1p), TG(18:1/18:2/22:2), DG(34:2p), DG(16:0/18:1), TG(16:0/10:1/18:2), TG(18:0/18:1/18:1), DG(34:1e), TG(16:0/16:0/23:0), DG(32:0p), DG(32:1p), and TG(18:0/18:0/18:1) ([69] Figures 3C, D ). Figure 3. [70]Figure 3 [71]Open in a new tab Pre-MetS lipid feature selection. (A, B) Biomarker signature lipid expression validation via SVM–RFE algorithm selection. (C) Random forest error rate versus the number of classification trees. (D) The top 16 relatively important lipids. (E) Adjustment of feature selection in the LASSO model. (F) Three algorithmic Venn diagrams screening lipids. All three algorithms employed ten-fold cross-validation for feature selection. Regarding the LASSO algorithm, after tenfold cross-validation, the optimal lambda (λ) was 0.02038657. Using a λ value of 0.045 that corresponded to the minimum partial likelihood deviance ([72] Figure 3E ), 11 feature lipids were selected: TG(18:1/18:2/22:2), PS(38:3), DG(16:0/18:1), TG(20:0/18:1/22:5), DG(36:1p), TG(16:0/16:0/16:0), DG(34:2p), TG(16:0/16:0/17:0), TG(25:0/18:1/18:1), DG(34:2p), and TG(16:0/18:1/20:3). Two lipids with shared features were identified from the LASSO, rf, and SVM-RFE algorithms: PS(38:3) and DG(16:0/18:1) ([73] Figure 3F and [74]Table 2 ). Table 2. The situation of two significantly different lipid metabolites identified by three machine learning methods in plasma between Normal and Pre-MetS. Molecule Subclass Formula m/z RT (min) VIP Log2|FC| P FDR PS (38:3)-H PS C44 H79 O10 N1 P1 812.54 12.23 2.11 1.20 1.79E-05 0.01 DG (16:0/18:1)+NH4 DG C37 H74 O5 N1 612.56 13.00 2.17 1.35 1.79E-05 0.01 [75]Open in a new tab The number before the ratio in parentheses is the length of the carbon chain, the number after the ratio is the number of double bonds on the carbon chain; -H and +NH4 are lipid molecule change groups. m/z: Mass-to-Charge Ratio; RT (min): Retention Time; VIP: Variable Importance in Projection; Log2|FC|: Log2 Fold Change; P: P-value; FDR: False Discovery Rate. 3.4. Feature selection using LASSO, rf and SVM-RFE for MetS patients The same three algorithms (LASSO, rf and SVM-RFE) were utilized to select the core lipid features associated with MetS patients. For SVM-RFE, when including 17 features, TG(52:5), TG(16:0/16:0/20:5), TG(16:0/14:0/20:5), TG(16:0/18:2/20:4), CerG1(d40:5), DG(32:0), TG(16:0/10:1/18:2), TG(15:0/16:1/17:0), TG(16:0/14:0/18:2), TG (48:3), TG (16:0/14:2/18:1), TG (50:3), TG (16:0/16:0/20:4), TG (16:1/18:2/18:3), TG (16:0/16:1/20:5), TG (16:0/14:1/22:6), and TG(16:1/18:2/20:4), the classifier accuracy reached a maximum value, and the error was minimized ([76] Figures 4A, B ). Using rf, 16 lipids were identified with relative importance >0.4, including: TG(16:0/16:0/20:4), TG(48:3), CerG1(d40:5), TG(16:0/18:2/20:4), TG (16:0/16:0/20:5), TG(16:0/14:2/18:1), TG(54:7), TG(16:0/14:1/22:6), TG (16:0/14:0/20:5), TG(16:0/14:0/18:2), TG(16:1/18:2/20:4), TG(52:5), TG (16:0/16:1/20:5), TG(14:0/18:2/18:3), TG(50:3), and TG(16:0/14:0/20:4) ([77] Figures 4C, D ). Regarding the LASSO algorithm, after tenfold cross-validation, the optimal lambda was 0.033. Using a λ value of 0.126 that corresponded to the minimum partial likelihood deviance ([78] Figure 4E ), 9 feature lipids were selected: TG (16:0/16:0/16:0), DG (18:2/20:4), TG (14:0/18:2/18:3), PI (16:0/16:1), TG (16:0/18:2/20:4), TG (18:1/18:2/22:5), TG (16:0/14:1/22:6), DG (32:0p), and DG (30:1p). Figure 4. [79]Figure 4 [80]Open in a new tab MetS lipid feature selection. (A, B) Biomarker signature lipid expression validation via SVM–RFE algorithm selection. (C) Random forest error rate versus the number of classification trees. (D) The top 17 relatively important lipids. (E) Adjustment of feature selection in the minimum absolute shrinkage and selection operator model (LASSO). (F) Three algorithmic Venn diagrams screening lipids. All three algorithms employed ten-fold cross-validation for feature selection. Three shared feature lipids were identified from the LASSO, rf and SVM-RFE algorithms: TG (16:0/14:1/22:6), TG (16:0/18:2/20:4), and TG (14:0/18:2/18:3) ([81] Figure 4F and [82]Table 3 ). Table 3. The situation of three significantly different lipid metabolites identified by three machine learning methods in plasma between Pre-MetS and MetS. Molecule Subclass Formula m/z RT (min) VIP Log2|FC| P FDR TG (16:0/14:1/22:6)+NH4 TG C55 H96 O6 N1 866.72 17.82 1.96 1.64 3.54E-08 6.88E-06 TG (16:0/18:2/20:4)+NH4 TG C57 H102 O6 N1 896.77 19.53 2.12 1.43 2.11E-08 6.88E-06 TG (14:0/18:2/18:3)+NH4 TG C53 H96 O6 N1 842.72 17.63 1.9 1.81 3.82E-07 4.00E-05 [83]Open in a new tab The number before the ratio in parentheses is the length of the carbon chain, the number after the ratio is the number of double bonds on the carbon chain; three sets of numbers indicate that the compound consists of three longer carbon chains; +NH4 are lipid molecule change groups. m/z: Mass-to-Charge Ratio; RT (min): Retention Time; VIP: Variable Importance in Projection; Log2|FC|: Log2 Fold Change; P: P-value; FDR: False Discovery Rate. 3.5. Machine learning models for pre-MetS and MetS identification An important application of lipidomics is the identification of potential disease biomarkers. Based on the feature selection results, PS(38:3) and DG(16:0/18:1) were identified as two important lipids for identifying pre-MetS ([84] Figure 5A ). We compared the performance of five popular machine learning algorithms on the test dataset to determine the optimal classification method for lipidomics data. These algorithms included glm, rpart, rf, lda, and pam. Due to the imbalanced sample sizes between the pre-MetS and control groups, we used balanced accuracy, F1-score, and AUC to evaluate the models. Among them, lda was identified as the best model with the highest balanced accuracy and F1-score, all exceeding 0.8 ([85] Figure 5B and [86]Figures 6A–E ). Figure 5. [87]Figure 5 [88]Open in a new tab (A) Determining lipid panels in pre-MetS based on three variable-selection methods. (B) Performance evaluation metrics for each ML-based model distinguish control individuals from pre-MetS patients. (C) Determining lipid panels in MetS based on three variable-selection methods. (D) Performance evaluation metrics for each ML-based model distinguishing pre-MetS from MetS. From left to right: glm, rpart, rf, lda, and pam. The repeated ten-fold cross-validation was used for model performance validation, while the ten-fold cross-validation was utilized for model training and parameter tuning. Figure 6. [89]Figure 6 [90]Open in a new tab Area under the receiver operating characteristic curves of five machine learning algorithms. (A–E) and (F–J) From left to right: generalized linear model (glm), recursive partitioning and regression (rpart), random forest (rf), linear discriminant analysis (lda), and prediction analysis for microarrays (pam). Based on the feature selection results, TG(16:0/14:1/22:6), TG(16:0/18:2/20:4), and TG(14:0/18:2/18:3) were identified as three important lipids for identifying MetS ([91] Figure 5C ). We used six performance metrics to evaluate the models, and rf demonstrated the best performance, with all metrics exceeding 0.8 ([92] Figure 5D and [93]Figures 6F–J ). 4. Discussion Metabolic risk factors present significant global challenges, necessitating effective strategies for early intervention. In this study, which involved a small sample of pre-MetS and MetS patients, we screened differential lipids between the two groups based on the expression levels of 1361 lipids and established identification models. Our results revealed significant differences in the levels of 77 lipids for pre-MetS compared to the control group and 143 lipids for MetS compared to the control group ([94] Figure 2 ). Furthermore, through machine learning, we selected the optimal lipid panel and models for identifying pre-MetS and MetS ([95] Figures 3 , [96]4 ), achieving model evaluation metrics exceeding 0.8 ([97] Figure 5 ). Previous studies have mainly focused on identifying metabolites associated with MetS ([98]16, [99]17). In contrast, our research emphasizes using machine learning-based lipid selection for identifying pre-MetS and MetS patients, particularly targeting middle-aged and elderly individuals at risk of metabolic dysfunction, and promoting effective interventions to modify risk factors, rather than relying solely on traditional risk factors. Our study differs from others in that we explored the differences in lipid metabolites between pre-MetS and MetS for the first time. Several explanations support this research. First, considering the complexity and heterogeneity of pre-MetS and MetS components ([100]19), a comprehensive assessment of lipid metabolism may better reflect the underlying disease progression, providing fundamental insights into the dynamic changes of MetS and enabling more specific treatments for patients. Second, considering the cumbersome nature of physical examinations during widespread screening and the potential for significant measurement errors and reduced efficiency due to variations in instruments, the diagnosis of pre-MetS and MetS may lead to false-positives. Therefore, lipid metabolites could serve as useful auxiliary indicators. In contrast to traditional classification, this study classified participants into three groups: control, pre-MetS, and MetS, aiming for a large-scale community-based screening program for MetS and cardiovascular disease prevention. In our research, the combinations of two and three biomarkers corresponded to LDA and rf models, respectively, with both exhibiting good discriminative ability in the validation set through sevenfold cross-validation (AUC of 0.89 for pre-MetS vs. control and 0.88 for pre-MetS vs. MetS) ([101] Figures 6D, H ). We found that higher levels of plasma DGs and TGs were positively correlated with the risk of pre-MetS and MetS. Consistent with previous studies ([102]16), we identified DG(36:2) as associated with MetS through OPLS-DA and univariate analysis ([103] Supplementary Table 1 ). Conversely, while previous research found that DG(34:1) was associated with MetS, we found it to be associated with pre-MetS. This is not surprising, as DGs act as bioactive lipids, serving as second messengers in insulin resistance induction, and TGs play a critical role in regulating fatty acid oxidation and lipid synthesis ([104]20), and are widely used to predict cardiovascular risk ([105]21). We identified a class of phospholipids (PE(18:0/18:1), PE(18:0/20:5), PS(38:3)) positively correlated with pre-MetS. Phosphatidylserine (PS) is involved in cell membrane composition and various signaling pathways, providing signals for immune cell recognition and phagocytosis during cell apoptosis ([106]22). Interestingly, immune-related dysregulation has been found to play a prominent role in pre-MetS ([107]12), which might be due to the biochemical pathways differing in the heterogeneity of pre-MetS populations in our study compared to other studies. We also found that levels of ceramides (Cer(d40:4), Cer(d40:5), Cer(d42:4)) were positively correlated with MetS. Total ceramide content is positively correlated with insulin resistance ([108]23). In fact, ceramides are involved in inducing cell apoptosis through various downstream targets ([109]24) and are associated with atherosclerosis ([110]25). Our study achieved favorable screening results with a relatively small number of lipids combined with corresponding models, yielding an AUC > 0.8. This indicated that the lipids we identified serve as excellent screening tools. However, the study has some limitations. Firstly, it is an exploratory study with a small sample size, which may lead to a certain degree of overfitting, although we mitigated this issue through various machine learning methods. Secondly, the LC-MS lipidomics technique can only differentiate lipids based on identification algorithms for subion, parent ion, and neutral loss scans, rather than providing clear and unique identification ([111]26). This complicates pathway enrichment analysis of different lipids in the study. Lastly, the participants in this study were all residents from coastal areas of China, and the results may not be extrapolated to other countries and inland regions. We hope that future research, combining larger sample sizes and multiomics studies, will further explore these findings. 5. Conclusion In this initial lipidomics analysis of pre-MetS and MetS, we identified relevant lipid features and selected 50 and 89 plasma lipid metabolites associated with pre-MetS and MetS patients, respectively. Furthermore, through machine learning, we selected two sets of plasma metabolites composed of PS(38:3), DG(16:0/18:1), and TG(16:0/14:1/22:6), TG(16:0/18:2/20:4), TG(14:0/18:2/18:3) as biomarkers for the identification models of pre-MetS and MetS in this study. Our results indicate that the identified biomarkers can reflect metabolic changes at different stages of MetS, providing a new perspective for monitoring disease progression and treatment response in pre-MetS and MetS patients. These findings hold promise for the differential diagnosis of pre-MetS and MetS, laying a foundation for future diagnostics and treatments. Data availability statement The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/[112] Supplementary Material . Ethics statement The studies involving humans were approved by ethical review committee of Fuzhou Center for Disease Control and Prevention. The studies were conducted in accordance with the local legislation and institutional requirements. The participants provided their written informed consent to participate in this study. Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article. Author contributions HH: Formal analysis, Validation, Writing – review & editing. XH: Formal analysis, Writing – original draft. HS: Investigation, Writing – review & editing. QH: Data curation, Writing - review & editing. XZ: Conceptualization, Methodology, Writing – review & editing, Funding acquisition, Resources. YX: Conceptualization, Methodology, Writing – review & editing. Acknowledgments