Graphical abstract graphic file with name fx1.jpg [37]Open in a new tab Highlights * • Protocol for processing metabolomics data in a prospective case-control study * • Steps of correlation, logistic regression, mediation, and variance partitioning analyses * • Access to source code with examples and explanations provided __________________________________________________________________ Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics. __________________________________________________________________ Liquid-chromatography-mass-spectrometry-based metabolomics is widely used in prospective case-control studies for disease prediction. Given the large amount of clinical and metabolomics data involved, data integration and analyses are crucial to provide an accurate understanding of the disease. We provide a comprehensive analysis approach to explore associations among clinical risk factors, metabolites, and disease. We describe steps for performing Spearman correlation, conditional logistic regression, casual mediation, and variance partitioning to investigate the potential effects of metabolites on disease. Before you begin With the advancement of analytical chemistry techniques, LC-MS (Liquid Chromatography-Mass Spectrometry) based metabolomic is growing rapidly and has been widely used in precision medicine for biomarker discovery and mechanistic understanding of diseases.[38]^2 Deep integration of metabolome profiling and clinical phenotypes based on prospective cohort studies has the potential to reveal the associations of clinical risk factors and metabolomic pathways during the development and progression of human diseases, which can provide strong support for individualized prevention and treatment.[39]^1^,[40]^3 This protocol presents analysis methods for processing metabolomics data in a prospective case-control study as employed in the recent publication,[41]^1 which allow us to reveal the potential linking of metabolite profiles and clinical phenotypes in diseases. To explore the associations among clinical factors, metabolites and disease, 4 parts of analysis were performed: 1) Spearman correlation to measure associations between metabolites and major clinical parameters; 2) conditional logistic regression analysis to quantify the association between metabolites and disease’s onset; 3) causal mediation analysis to evaluate potential causal impact of metabolites on the disease; 4) variance contribution analysis to explore the factors affecting metabolite levels. The 4 layers of in-depth analysis could give us a comprehensive view of relationships among clinical risk factor-metabolite-disease, which lead to a better understanding of the development and progression of human diseases. For each part of analysis, R functions and example codes have been provided to facilitate users for metabolomic data analysis in other prospective studies with nested case-control study design. All input datasets, outcomes and codes are available on the published data repositories Zenodo ([42]https://zenodo.org/) at [43]https://doi.org/10.5281/zenodo.7601709. Input datasets specification This protocol assumes the availability of 3 datasets for all the participants in a prospective study, including clinical characteristics data at baseline, metabolomics data at baseline and disease condition data at follow-up. Corresponding example datasets are provided on Zenodo at [44]https://doi.org/10.5281/zenodo.7601709, as “clinic.txt” ([45]Figure 1A), “metabolites.txt” ([46]Figure 1B), and “outcome.txt” ([47]Figure 1C) respectively. The first column of all 3 datasets should be unified as “ID”, which is the unique identifier of the participants in the study. “CP” in “clinic.txt” file means clinical parameters (CPs) which could be information on sociodemographic characteristics (e.g., age, sex), lifestyle factors (e.g., cigarette smoking, alcohol drinking, physical activity, and dietary habits), obesity measurements (e.g., body mass index, waist circumference, hip circumference, waist to hip ratio) and other clinical risk factors of interest (e.g., blood pressure, glycemic measures, medical history) ([48]Figure 1A). “M” in “metabolites.txt” file means metabolites detected by LC-MS, such as amino acids, lipids, vitamins, nucleotides ([49]Figure 1B). For “outcome.txt” file, “case” column indicates disease status of yes (case = 1) and no (case = 0); “match” column indicates the matched ID of case-control, and a case and corresponding controls should share the same matched ID ([50]Figure 1C). Figure 1. [51]Figure 1 [52]Open in a new tab Headshots of input datasets (A) Clinical characteristics data (clinic.txt), CP represents clinical parameter. (B) Metabolomic data (metabolites.txt), M represents metabolites. (C) Disease information (outcome.txt), case indicates disease status, match means match ID, a case and its’ matched controls share the same match ID. Computational requirements Software: Linux (kernel version 5.4 or 5.10: e.g., in Ubuntu 20.04 or above); Windows 10 or MacOS (High Sierra or above). The R programming language (version >= 4.1) ([53]https://www.r-project.org). Key resources table REAGENT or RESOURCE SOURCE IDENTIFIER Software and algorithms __________________________________________________________________ R (v 4.2.2) R Core Team [54]https://www.r-project.org Survival (v 3.4-0) CRAN [55]https://cran.r-project.org/web/packages/survival/index.html Mediation (v 4.5.0) CRAN [56]https://cran.r-project.org/web/packages/mediation/index.html variancePartition (v 1.26.0) Bioconductor 3.15 [57]http://bioconductor.org/packages/3.15/bioc/html/variancePartition.h tml gplots (v 3.1.3) CRAN [58]https://cran.r-project.org/web/packages/gplots/index.html reshape2 (v 1.4.4) CRAN [59]https://cran.rstudio.com/bin/windows/contrib/4.1/reshape2_1.4.4.zip ggpattern (v 1.0.1) CRAN [60]http://mirrors.pku.edu.cn/CRAN/web/packages/ggpattern/index.html __________________________________________________________________ Deposited data __________________________________________________________________ Example inputs Example inputs.rar [61]https://doi.org/10.5281/zenodo.7601709 R functions functions.R [62]https://doi.org/10.5281/zenodo.7601709 Codes Statistical analysis.R [63]https://doi.org/10.5281/zenodo.7601709 Example outputs Example outputs.rar [64]https://doi.org/10.5281/zenodo.7601709 [65]Open in a new tab Step-by-step method details Step 1: Datasets and codes import Inline graphic Timing: 5 min * 1. Import three datasets which have been prepared in “before you begin”. + a. Import “outcome.txt”. > outcome.d<-read.table(file="outcome.txt",sep="\t",quot="",fill =T,head=T,row.names=1) + b. Import “metabolites.txt”. > meta.d.1<-read.table(file="metabolites.txt",sep="\t",quot="",f ill=T,head=T,row.names=1) + c. Import “clinic.txt”. > clinic.d.1<-read.table(file="clinic.txt",sep="\t",quot="",fill =T,head=T,row.names=1) * 2. Reorder three datasets by the same order according to ID. + a. Log-transform and reorder metabolomics data. > meta.d<-log(meta.d.1[rownames(outcome.d),]) + b. Reorder clinical characteristics data. > clinic.d<-clinic.d.1[rownames(outcome.d),] * 3. Import “functions.R” file. > source("functions.R") Note: "functions.R" contains 9 functions required for data analysis. The 9 functions are: 1) “NT” function for data distribution test; 2) “SC” function for Spearman correlation analysis; 3) “OR”, “split_dataset”, “random_dataset”, and “random_OR” functions for conditional logistic regression analysis and further validation; 4) “CM” function for causal mediation analysis; and 5) “VP” and “VPplot” functions for variance partitioning analysis. The applications of these functions are detailed in the following step-by-step protocols. All the example input data needed to reproduce the codes can be downloaded from [66]https://doi.org/10.5281/zenodo.7601709. Note: After imported datasets and functions.R file for analysis, the following 5 steps including data distribution test (step 2) and 4 parts of analysis (steps 3, 4, 5 and 6) can be applied separately. Step 2: Data distribution test Inline graphic Timing: 10 min It is important to check the distribution of research data before analysis to get an overview and comprehensive understanding of the metabolomic data and clinical parameters. Shapiro-Wilk test is a formal hypothesis test which commonly used for testing normality of data distribution and the statistic of Shapiro-Wilk test (W) is a measure of linearity of the Q–Q plot ([67]Figures 2A and 2B). Significant results (p < 0.05) reject the hypothesis that the population has a normal distribution, and indicate a decision that the sample comes from a nonnormal population. However, formal hypothesis tests of assumptions appear to become less useful as sample size increases. In this condition, informal judgments of normality diagnostic graphs including Q-Q plots and histograms are recommended for visualizing and judging the distribution of data ([68]Figures 2C and 2D).[69]^4 “NT” function facilitates users for normality test, which gives results of Shapiro-Wilk test, as well as histogram and Q-Q plots automatically for dataset with multiple variables. * 4. Normality test by “NT” function. + a. Normality test for metabolomic data. > NT(data=meta.d) + b. Normality test for clinical parameters. > NT(data=clinic.d) Note: Normality test gives the results of Shapiro-Wilk test, as well as histogram and Q-Q plots. “normality.test.csv” file contains results of Shapiro-Wilk test and “normality.test.pdf” file contains histogram and Q-Q plot of each variable ([70]Figure 2). Note: For general characteristics of clinical parameters of the study population, differences between cases and controls are commonly evaluated using the Student’s t test for continuous variables with normal distribution, paired Wilcoxon rank-sum test for those with skewed distribution, and the Chi-squared test for categorical variables. Figure 2. [71]Figure 2 [72]Open in a new tab Results of normality test (A) Shapiro-Wilk test of metabolomic data (normality.test.csv). (B) Shapiro-Wilk test of clinical data (normality.test.csv). (C) Histogram and Q-Q plot of metabolomic data (normality.test.pdf). (D) Histogram and Q-Q plot of clinical data (normality.test.pdf). Step 3: Spearman correlation analysis Inline graphic Timing: 10 min Correlation coefficients are used to measure the association between two variables. Pearson correlation is a common method used in the context of a linear relationship between two normally distributed continuous variables, while Spearman correlation can be used to measure a monotonic relationship. Considering of 1) some clinical parameters (such as smoke status/diet score) were presented by grade, 2) metabolite profiles were usually non-normal distribution, we use Spearman correlation analysis to assess the associations of individual metabolites with the baseline clinical parameters in our analysis. We provide “SC” function to calculate Spearman correlation coefficient. * 5. Spearman correlations analysis by “SC” function. + a. Calculate Spearman's ρ and p values of Spearman correlation analysis. > SCresult<-SC(data1=clinic.d, data2=meta.d) > SCmatrix<-SCresult[[1]] > pvalues<-SCresult[[2]] + b. Save Spearman's ρ and p values to “SCmatrix.csv” ([73]Figure 3A) and “SCpmatrix.csv” ([74]Figure 3B) file respectively. > write.csv (SCmatrix,"SCmatrix.csv") > write.csv (pvalues,"SCpmatrix.csv") * 6. Visualized by heatmap. > pdf("SC.pdf") > heatmap.2(SCmatrix,col=bluered(75),key=TRUE, symkey=FALSE, trace="none", cexRow=0.8, margins=c(20,20)) > dev.off() Note: Results of Spearman correlation are visualized by a heatmap named “SC.pdf” ([75]Figure 3C). We use hierarchical clustering analysis (with complete linkage) to group metabolites based on absolute Spearman correlations as a measure of similarity. Figure 3. [76]Figure 3 [77]Open in a new tab Results of Spearman correlation analysis (A)Results of Spearman's ρ (SCmatrix.csv). (B) Results of p values (SCpmatrix.csv). (C) Heatmap of Spearman correlation analysis (SC.pdf). Results interpretation Relationships with a two-tailed p value < 0.05 is considered statistically significant. Appropriate multiple tests of statistical significance could be considered according to the study design. Inline graphic CRITICAL: Pearson’s is the most popular correlation coefficient, and works well if a linear relationship is suggested. However, when data are non-normally distributed, a test of the significance of Pearson correlation coefficient may inflate type I error rates and reduce power. Normality checking (see “Data distribution test” part) would be helpful for deciding whether Pearson’s correlation analysis is applicable or not. For skewed data, log transformation might be helpful in some cases, but data transformations must be applied very cautiously. Pearson correlation coefficient can be calculated by changing method parameter “Spearman” in “SC” function into “Pearson”. Step 4: Logistic regression analysis Inline graphic Timing: 30 min Regression analysis is a way to quantify trends in data. Logistic regression analysis uses the logit model to evaluate the association between binary dependent (response) variables and a set of independent (explanatory) variables. However, in the matched data with cases and matched controls, the unconditional logistic expression is biased with overestimated odds ratio (OR). Conditional logistic regression which allows the consideration of matching and stratification, is often used for analysis in the case-control study. The associations between metabolites and risk of disease were estimated using multivariable conditional logistic regression models, with adjustment of age, sex and other clinical parameters. ORs (95%CI) were presented as per SD increment of the metabolites. R package “survival” is required and “OR” function is provided for OR calculating. * 7. Define formular for logit model. > form1<-case∼meta+age+as.factor(sex)+CP8+as.factor(CP9)+as.factor(CP10)+ CP13+as.factor(CP11)+as.factor(CP12)+CP14+CP1+CP2+CP3+CP6+strata(match) Inline graphic CRITICAL: An example formular was given below which adjusted for age, sex and other clinical parameters (CPs). The users should define their own formular according to their variables. * 8. Logistic regression analysis by “OR” function. + a. Calculate ORs. > ORresult<-OR(form=form1, meta.d=meta.d, clinic.d=clinic.d, outcome.d=outcome.d) + b. Save the results to “OR.csv”file ([78]Figure 4). > write.csv(ORresult,"OR.csv") Note: To validate the reliability of the identified metabolites associated with the disease’s onset, internal validation tests and random sampling were usually performed as sensitivity analysis. In our study, internal validation test in two randomly split subsets were performed. Moreover, 200 times of tests were conducted by randomly selecting 80% of all samples as a subset each time. “split_dataset” function is created for splitting dataset to 2 sub datasets in customized proportion. “random_dataset” function is used for generating random sampling dataset and “random_OR function” is used for conditional logistic analysis in random sampling dataset. [79]Figure 4 shows the workflow of conditional logistic regression and validation. * 9. Validate ORs in two randomly split subsets. + a. Generate two randomly split subsets as dataset1 and dataset2. > split.d<-split_dataset(n.seed=1, proportion.subdataset1 = 0.5) Note: We use 50% participants in each subset as example (proportion.subdataset1 = 0.5). Users could define their size of sub datasets by setting the parameter “proportion.subdataset1”. + b. Calculate ORs for dataset1 and dataset2. > t_ORresult<-OR(form=form1,meta.d=split.d[[1]],clinic.d=split.d [[2]],outcome.d=split.d[[3]]) > v_ORresult<-OR(form=form1,meta.d=split.d[[4]],clinic.d=split.d [[5]],outcome.d=split.d[[6]]) + c. Export the results of 2 subsets to “dataset1_OR.csv” and “dataset2_OR.csv” ([80]Figure 4). > write.csv(t_ORresult,"dataset1_OR.csv") > write.csv(v_ORresult,"dataset2_OR.csv") * 10. Validate ORs in random sampling datasets. + a. Calculate and statistic ORs of random sampling datasets. > random_ORresult<-random_OR(times.random=200, percentage =0.8) Note: We use 200 times of tests for 80% randomly selected samples as example (times.random = 200, percentage = 0.8). Users could define their times of sampling and size of each sampling by setting the parameters “times.random” and “percentage”. + b. Save the results to “randomdataset_OR.csv” file. > write.csv(random_ORresult,"randomdataset_OR.csv") Note: In “randomdataset_OR.csv” file, the number of significant ORs (including p < 0.05 and fdr<0.05) was count to assess the repeatability of logistic analysis ([81]Figure 4). Figure 4. [82]Figure 4 [83]Open in a new tab Workflow and results of conditional logistic analysis Red faces represent cases in the study and black faces are the matched controls. ORs were calculated in full dataset and were validated in both sub datasets and random sampling datasets (OR.csv). Sub datasets were created by splitting the dataset into two independent subsets (dataset1_OR.csv and dataset2_OR.csv). Random sampling datasets were generated by randomly selecting samples of certain size within full datasets for multiple times (randomdataset_OR.csv). Results interpretation p values for false discovery rate (FDR) were estimated using the Benjamini-Hochberg method. ORs with FDR < 0.05 are considered to be significant. Step 5: Causal mediation analysis Inline graphic Timing: 9 h After previous Spearman correlation analysis (step 3) and logistic analysis (step 4), significant associations on metabolites-disease and metabolites-clinical risk factors have been found. However, it is still unclear that which clinical risk factors could modify the association between level of metabolites and disease risk. Causal mediation analysis can provide causal interpretation for effect of metabolites on disease’s onset. To calculate the total, direct and indirect effects of exposure on outcome, causal mediation analysis compared 2 regression models in which one model is conditioned on the mediator and the other is not ([84]Figure 5A). The following 5 steps are included: 1) examine associations between exposure and mediator (path exposure to mediator, the estimate a in [85]Figure 5A); 2) examine associations of mediator as independent variable with outcome (path mediator to outcome, the estimate b in [86]Figure 5A); 3) assess simple total effect of exposure on outcome (path exposure to outcome, the estimate c in [87]Figure 5A); 4) assess the direct effect of exposure on outcome with mediators controlled (path exposure to outcome not through the mediators, the estimate c’ in [88]Figure 5A); and 5) calculate the indirect effect of exposure on outcome by mediators (path exposure to outcome through the mediators, the estimate a×b in [89]Figure 5A).[90]^5 It should be noted that total effect = direct effect + indirect effect, in other words, c = c’ + ab. Figure 5. [91]Figure 5 [92]Open in a new tab Causal mediation analysis (A) Mediation model. (B) Results of causal mediation analysis (causal mediation.csv). R package “mediation” and “CM” function we provided are required for causal mediation analysis. * 11. Define formulars for causal mediation analysis. Note: Three models are required for causal mediation analysis as follows: Y = cX + e1, M = aX + e2, and Y = c’X+ bM + e3, where Y as outcome, X as exposure, M as mediator, c as total effect of metabolites, c’ as direct effect of metabolites, and ab as indirect effect of metabolites mediated by risk factor. Here we give example formulars for performing causal mediation analysis on path of metabolites (exposure)-clinical risk factors (mediator)-disease (outcome). It helps us to get better understanding of potential biological roles of metabolites by assessing their indirect effects on disease operated through clinical risk factors. Users should define their own formulars basing on their hypothesis of exposure, mediator and outcome, respectively. + a. Define formular: M = aX + e2. > form.a<-clinic∼meta + b. Define formular: Y = c'X + bM + e3. > form.b<-case∼meta+clinic + c. Define formular: Y = cX + e1. > form.c<-case∼meta * 12. Perform causal mediation analysis by “CM” function. > CMresult<-CM (form.a, form.b, form.c) * 13. Write the results to “causal mediation.csv” file. > write.csv (CMresult,"causal mediation.csv", row.names=F) Results interpretation The following criteria need to be satisfied for a variable to be considered as a mediator ([93]Figure 5B): 1) the exposure variable significantly affects the mediator (beta and p values of estimate a in “causal mediation.csv” file); 2) there is a significant relationship between the mediator and the outcome (beta and p values of estimate b in “causal mediation.csv” file); 3) the exposure variable significantly affects the outcome (beta and p values of estimate c in “causal mediation.csv” file). Direct effects ADE and corresponding p values are also given in “causal mediation.csv” file. ACME represents for indirect effects and P[acme] < 0.05 is defined as significant causal mediation relationships. The mediation proportion (prop in “causal mediation.csv” file) is used to assess how much of the effect of metabolites on disease is mediated through clinical measurements. Generally, mediated proportion >10% is recognized as substantial mediation effects. Step 6: Variance partitioning analysis Inline graphic Timing: 10 min Variance partitioning is a statistical method to quantify the contribution of multiple sources of variation and decouple within/between-individual variation. Here we assessed the variance in metabolite levels explained by each clinical measurement at baseline by variance partitioning. R package “variancePartition” and “VP” function are required for variance partitioning calculation. * 14. Define formular for variance partitioning analysis. > form1<-meta∼age+sex+CP8+CP9+CP10+CP13+CP11+CP12+CP14+CP1+CP2+CP3+CP6 Note: All clinical variables which might related to levels of metabolites were included here. Users should define their own formular according to clinical variables of interest. * 15. Perform variance partitioning analysis by “VP” function. + a. Calculate variance contribution and correlation type. > VPresult<-VP(form1) > vars<-Vpresult[[1]] > corr_type<-Vpresult[[2]] + b. Save the results of variance contribution and correlation type to “VP_var.csv” ([94]Figure 6B) and “VP_corrtype.csv” ([95]Figure 6C). > write.csv(vars,"VP_var.csv") > write.csv(corr_type,"VP_corrtype.csv") * 16. Find the biggest variance. + a. Extract the variance explained the most of the total variation. > name_max=array(,c(nrow(vars),1)) > var_max=array(,c(nrow(vars),1)) > corr_type_max=array(,c(nrow(vars),1)) > for(I in 1:nrow(vars)){ k<-which.max(vars[i,]) name_max[i,1]=colnames(vars)[k] var_max[i,1]=vars[i,k] corr_type_max[i,1]=corr_type[i,k] } > rownames(name_max)<-rownames(var_max)<-rownames(corr_type_max) <-rownames(vars) > colnames(name_max)="name" > colnames(var_max)="var" > colnames(corr_type_max)="corr_type" + b. Visualize the max variance contribution for each metabolite as bar plot ([96]Figure 6A). > pdf("VPplot.pdf") > VPplot(var=var_max,corr_type=corr_type_max) > dev.off() Figure 6. [97]Figure 6 [98]Open in a new tab Results of variance partitioning analysis (A) Bar plot of the variance which explained the most of the total variation of metabolite levels (VPplot.pdf). (B) Percentage of variance contribution for metabolites (VP_var.csv). (C) Sign of beta in linear model regression of clinical parameters and metabolites (VP_corrtype.csv). Results interpretation Variance explained over 2% of total variation are recognized as prominent. If a metabolite can be rarely explained by traditional clinical measures, these metabolites might be meaningful as new prediction factors since it is hardly replaced by traditional clinical measures. Expected outcomes 13 files including 10 .csv files and 3 plots are expected through above 6 steps and the details are listed below. All example files are available on Zenodo at [99]https://doi.org/10.5281/zenodo.7601709. Analysis Outcomes Example file name Data distribution test Results of Shapiro-Wilk test and Histogram and Q-Q plot for data normality test normality.test.csv ([100]Figures 2C and 2D) normality.test.pdf ([101]Figures 2A and 2B) Spearman correlation analysis Spearman correlation coefficients between clinical parameters and metabolites SCmatrix.csv ([102]Figure 3A) SCpmatrix.csv ([103]Figure 3B) Visualization SC_heatmap.pdf ([104]Figure 3C) Conditional logistic regression analysis The ORs (95%CI) per SD increment of disease risk for each metabolite OR.csv ([105]Figure 4) OR validation in 2 sub datasets dataset1_OR.csv ([106]Figure 4) dataset2_OR.csv ([107]Figure 4) OR validation in random sampling datasets randomdataset_OR.csv ([108]Figure 4) Causal mediation analysis Causal mediation analysis on metabolites-clinic risk factors-disease causal mediation.csv ([109]Figure 5B) Variance partitioning analysis variance contribution of clinic risk factors to metabolites VP_var.csv ([110]Figure 6B) VP_corr_type.csv ([111]Figure 6C) Visualization VPplot.pdf ([112]Figure 6A) [113]Open in a new tab Limitations This protocol provides a useful strategy for processing of LC-MS based metabolomics data in prospective nested case-control studies with R. The aim of this protocol is deeply integrating metabolome profiling and clinical phenotypes, and exploring the relationship of clinical risk factors and metabolomic pathways in the development and progression of human diseases. In addition to general analysis strategies, we provide customization parameters as much as possible to add adaptability of this protocol. However, there are still some limitations in application. Firstly, metabolic pathway enrichment analysis is a popular way for metabolomic profiling data mining and visualization, which is helpful for exploring the critical pathways and biological significance of diseases. In our study, 50 target metabolites in particular pathway (amino acids and microbiota-related metabolites) were detected.[114]^1 It’s unnecessary and improper to perform pathway enrichment in this case. However, pathway enrichment analysis is highly recommended for researchers who performed untargeted metabolomic detection which has no selection bias of metabolites. We didn’t provide method of pathway enrichment analysis in this paper. MetaboAnalyst ([115]www.metaboanalyst.ca) is a free web-based tool which provides approach to metabolic pathway analysis and enrichment analysis.[116]^6 Secondly, in Logistic regression analysis, validation study is an important way to understand and mitigate information bias.[117]^7 It was conducted by sampling at random for internal validation in our logistic regression analysis in step 4. However, there’re many kinds of validation methods could be considered. Users may want to design validation study in other way, for example, sampling based on the true gold standard measure or stratified by specific variables. The codes available can be only used for performing conditional logistic regression analysis in a validation test which is sampled at random. Considering the universality of the code, we provide “OR” function for customized datasets, by which users can conduct logistic regression analysis by importing validation data of their own. Thirdly, in causal mediation analysis, bidirectional mediation analysis on both directions (direction 1: clinical risk factors-metabolites-disease; direction 2: clinical risk factors-disease-metabolites) could provide more robust evidence to support the hypothesis that metabolites causally impact the onset of disease or vice versa, than the causal mediation analysis. However, bidirectional mediation analysis on direction 2 demands metabolomics data at follow-up visit after disease occurs. Considering that only metabolites at baseline were detected in our prospective study, we only performed causal mediation analysis on direction 1. Users who need to conduct bidirectional mediation analysis could modify the formulars according to the study design. Finally, VariancePartition will divide the contribution over multiple variables that are strongly correlated. Therefore, including more correlated variables in the model will reduce the variance of individual factors . It is useful to include all variables in the first analysis and then drop variables that have minimal effect. Considering that personalized selection of variables according to individual research focus may be more appropriate, we didn’t provide workflow for variables dropping in this paper. Troubleshooting Problem 1 Package install failure errors might be presented during packages installation when importing “functions.R” file in [118]step 1: datasets and codes import as follows, 1) “Package is not available” and 2) “failed to lock directory”. Potential solution * • For error (1) “Package is not available”, R packages are primarily distributed as source packages, but binary packages (a packaging up of the installed package) are also supported. Try to install binary packages instead by using > Install.packages (pkgs = package, binary = T). * • The reason of error (2) “failed to lock directory” exists is that, the library directory is ‘locked’ to prevents any other process installing into that library to store previous version of the package. The solution is to remove the lock directory of this package manually. Problem 2 Error occurs as “cannot open file: No such file or directory” when required files (“functions.R” or input datasets) can’t be found in R working directory in [119]step 1: datasets and codes import. Potential solution * • Make sure that the following files: “functions.R”, “metabolites.txt”, “clinic.txt” and “outcome.txt” are in the same folder. File locations in Chinese or other languages may not be recognized by RStudio, so ensure to use English location if running RStudio. Please set R working directory to the correct location by running the following command in R. > setwd (dir=“target direction”) * • If source codes and example files are downloaded on Zenodo, please unzip the file “example inputs” and extract all 3 example input files to the same location of “functions.R”. Problem 3 Warning of conversion failure might appear when generating a pdf file containing a plot in [120]step 3: spearman correlation analysis (SC_heatmap.pdf) and [121]step 6: variance partitioning analysis (VPplot.pdf). Potential solution RStudio IDE which is set to use UTF-8 as its default encoding. If a plot contains Greek characters such as α or β, conversion failure occurs. The solution is changing encoding type of pdf file to “ISOLatin2.enc” by running command below. > pdf (file=filename, encoding= “ISOLatin2.enc”) Problem 4 There may be no response to conditional logistic regression analysis in [122]step 4: logistic regression analysis. “OR” function is only for conditional logistic regression analysis which required information of case-control matched ID in a case-control study. If no matched information is provided, no result will be generated when calling “OR” function. Potential solution Please interrupt R to stop running “OR” function and check the matched ID in 2 steps. First, make sure case-control matched IDs are included in “match” column of “outcome.txt” file. Second, make sure matched IDs are assigned in the model for OR calculation as “case ∼ meta + strata(match)”. Problem 5 In [123]step 4: logistic regression analysis, zero is produced in OR results (OR.csv) if any error occurs in OR calculation according to our “OR” function. Potential solution Missing values in metabolites data will throw an error. Therefore, missing values of metabolites need to be imputed. Metabolites with ≥20% missingness should be excluded and 50% of the lowest measured value or limit of detection are recommended for imputation. Resource availability Lead contact Further information and requests for resources should be directed to and will be fulfilled by the lead contact Jieli Lu ([124]ljl11319@rjh.com.cn). Materials availability This study did not generate new unique reagents. Acknowledgments