Graphical abstract

   graphic file with name fx1.jpg
   [37]Open in a new tab

Highlights

     * •
       Protocol for processing metabolomics data in a prospective
       case-control study
     * •
       Steps of correlation, logistic regression, mediation, and variance
       partitioning analyses
     * •
       Access to source code with examples and explanations provided
     __________________________________________________________________

   Publisher’s note: Undertaking any experimental protocol requires
   adherence to local institutional guidelines for laboratory safety and
   ethics.
     __________________________________________________________________

   Liquid-chromatography-mass-spectrometry-based metabolomics is widely
   used in prospective case-control studies for disease prediction. Given
   the large amount of clinical and metabolomics data involved, data
   integration and analyses are crucial to provide an accurate
   understanding of the disease. We provide a comprehensive analysis
   approach to explore associations among clinical risk factors,
   metabolites, and disease. We describe steps for performing Spearman
   correlation, conditional logistic regression, casual mediation, and
   variance partitioning to investigate the potential effects of
   metabolites on disease.

Before you begin

   With the advancement of analytical chemistry techniques, LC-MS (Liquid
   Chromatography-Mass Spectrometry) based metabolomic is growing rapidly
   and has been widely used in precision medicine for biomarker discovery
   and mechanistic understanding of diseases.[38]^2 Deep integration of
   metabolome profiling and clinical phenotypes based on prospective
   cohort studies has the potential to reveal the associations of clinical
   risk factors and metabolomic pathways during the development and
   progression of human diseases, which can provide strong support for
   individualized prevention and treatment.[39]^1^,[40]^3 This protocol
   presents analysis methods for processing metabolomics data in a
   prospective case-control study as employed in the recent
   publication,[41]^1 which allow us to reveal the potential linking of
   metabolite profiles and clinical phenotypes in diseases.

   To explore the associations among clinical factors, metabolites and
   disease, 4 parts of analysis were performed: 1) Spearman correlation to
   measure associations between metabolites and major clinical parameters;
   2) conditional logistic regression analysis to quantify the association
   between metabolites and disease’s onset; 3) causal mediation analysis
   to evaluate potential causal impact of metabolites on the disease; 4)
   variance contribution analysis to explore the factors affecting
   metabolite levels. The 4 layers of in-depth analysis could give us a
   comprehensive view of relationships among clinical risk
   factor-metabolite-disease, which lead to a better understanding of the
   development and progression of human diseases. For each part of
   analysis, R functions and example codes have been provided to
   facilitate users for metabolomic data analysis in other prospective
   studies with nested case-control study design. All input datasets,
   outcomes and codes are available on the published data repositories
   Zenodo ([42]https://zenodo.org/) at
   [43]https://doi.org/10.5281/zenodo.7601709.

Input datasets specification

   This protocol assumes the availability of 3 datasets for all the
   participants in a prospective study, including clinical characteristics
   data at baseline, metabolomics data at baseline and disease condition
   data at follow-up. Corresponding example datasets are provided on
   Zenodo at [44]https://doi.org/10.5281/zenodo.7601709, as “clinic.txt”
   ([45]Figure 1A), “metabolites.txt” ([46]Figure 1B), and “outcome.txt”
   ([47]Figure 1C) respectively. The first column of all 3 datasets should
   be unified as “ID”, which is the unique identifier of the participants
   in the study. “CP” in “clinic.txt” file means clinical parameters (CPs)
   which could be information on sociodemographic characteristics (e.g.,
   age, sex), lifestyle factors (e.g., cigarette smoking, alcohol
   drinking, physical activity, and dietary habits), obesity measurements
   (e.g., body mass index, waist circumference, hip circumference, waist
   to hip ratio) and other clinical risk factors of interest (e.g., blood
   pressure, glycemic measures, medical history) ([48]Figure 1A). “M” in
   “metabolites.txt” file means metabolites detected by LC-MS, such as
   amino acids, lipids, vitamins, nucleotides ([49]Figure 1B). For
   “outcome.txt” file, “case” column indicates disease status of yes
   (case = 1) and no (case = 0); “match” column indicates the matched ID
   of case-control, and a case and corresponding controls should share the
   same matched ID ([50]Figure 1C).

Figure 1.

   [51]Figure 1
   [52]Open in a new tab

   Headshots of input datasets

   (A) Clinical characteristics data (clinic.txt), CP represents clinical
   parameter.

   (B) Metabolomic data (metabolites.txt), M represents metabolites.

   (C) Disease information (outcome.txt), case indicates disease status,
   match means match ID, a case and its’ matched controls share the same
   match ID.

Computational requirements

   Software: Linux (kernel version 5.4 or 5.10: e.g., in Ubuntu 20.04 or
   above); Windows 10 or MacOS (High Sierra or above).

   The R programming language (version >= 4.1)
   ([53]https://www.r-project.org).

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Software and algorithms
     __________________________________________________________________

   R (v 4.2.2) R Core Team [54]https://www.r-project.org
   Survival (v 3.4-0) CRAN
   [55]https://cran.r-project.org/web/packages/survival/index.html
   Mediation (v 4.5.0) CRAN
   [56]https://cran.r-project.org/web/packages/mediation/index.html
   variancePartition (v 1.26.0) Bioconductor 3.15
   [57]http://bioconductor.org/packages/3.15/bioc/html/variancePartition.h
   tml
   gplots (v 3.1.3) CRAN
   [58]https://cran.r-project.org/web/packages/gplots/index.html
   reshape2 (v 1.4.4) CRAN
   [59]https://cran.rstudio.com/bin/windows/contrib/4.1/reshape2_1.4.4.zip
   ggpattern (v 1.0.1) CRAN
   [60]http://mirrors.pku.edu.cn/CRAN/web/packages/ggpattern/index.html
     __________________________________________________________________

   Deposited data
     __________________________________________________________________

   Example inputs Example inputs.rar
   [61]https://doi.org/10.5281/zenodo.7601709
   R functions functions.R [62]https://doi.org/10.5281/zenodo.7601709
   Codes Statistical analysis.R [63]https://doi.org/10.5281/zenodo.7601709
   Example outputs Example outputs.rar
   [64]https://doi.org/10.5281/zenodo.7601709
   [65]Open in a new tab

Step-by-step method details

Step 1: Datasets and codes import

     Inline graphic Timing: 5 min

     * 1.
       Import three datasets which have been prepared in “before you
       begin”.
          + a.
            Import “outcome.txt”.
            >
            outcome.d<-read.table(file="outcome.txt",sep="\t",quot="",fill
            =T,head=T,row.names=1)
          + b.
            Import “metabolites.txt”.
            >
            meta.d.1<-read.table(file="metabolites.txt",sep="\t",quot="",f
            ill=T,head=T,row.names=1)
          + c.
            Import “clinic.txt”.
            >
            clinic.d.1<-read.table(file="clinic.txt",sep="\t",quot="",fill
            =T,head=T,row.names=1)

     * 2.
       Reorder three datasets by the same order according to ID.
          + a.
            Log-transform and reorder metabolomics data.
            > meta.d<-log(meta.d.1[rownames(outcome.d),])
          + b.
            Reorder clinical characteristics data.
            > clinic.d<-clinic.d.1[rownames(outcome.d),]

     * 3.
       Import “functions.R” file.

   > source("functions.R")

     Note: "functions.R" contains 9 functions required for data analysis.
     The 9 functions are: 1) “NT” function for data distribution test; 2)
     “SC” function for Spearman correlation analysis; 3) “OR”,
     “split_dataset”, “random_dataset”, and “random_OR” functions for
     conditional logistic regression analysis and further validation; 4)
     “CM” function for causal mediation analysis; and 5) “VP” and
     “VPplot” functions for variance partitioning analysis. The
     applications of these functions are detailed in the following
     step-by-step protocols. All the example input data needed to
     reproduce the codes can be downloaded from
     [66]https://doi.org/10.5281/zenodo.7601709.

     Note: After imported datasets and functions.R file for analysis, the
     following 5 steps including data distribution test (step 2) and 4
     parts of analysis (steps 3, 4, 5 and 6) can be applied separately.

Step 2: Data distribution test

     Inline graphic Timing: 10 min

   It is important to check the distribution of research data before
   analysis to get an overview and comprehensive understanding of the
   metabolomic data and clinical parameters. Shapiro-Wilk test is a formal
   hypothesis test which commonly used for testing normality of data
   distribution and the statistic of Shapiro-Wilk test (W) is a measure of
   linearity of the Q–Q plot ([67]Figures 2A and 2B). Significant results
   (p < 0.05) reject the hypothesis that the population has a normal
   distribution, and indicate a decision that the sample comes from a
   nonnormal population. However, formal hypothesis tests of assumptions
   appear to become less useful as sample size increases. In this
   condition, informal judgments of normality diagnostic graphs including
   Q-Q plots and histograms are recommended for visualizing and judging
   the distribution of data ([68]Figures 2C and 2D).[69]^4 “NT” function
   facilitates users for normality test, which gives results of
   Shapiro-Wilk test, as well as histogram and Q-Q plots automatically for
   dataset with multiple variables.
     * 4.
       Normality test by “NT” function.
          + a.
            Normality test for metabolomic data.
            > NT(data=meta.d)
          + b.
            Normality test for clinical parameters.
            > NT(data=clinic.d)

     Note: Normality test gives the results of Shapiro-Wilk test, as well
     as histogram and Q-Q plots. “normality.test.csv” file contains
     results of Shapiro-Wilk test and “normality.test.pdf” file contains
     histogram and Q-Q plot of each variable ([70]Figure 2).

     Note: For general characteristics of clinical parameters of the
     study population, differences between cases and controls are
     commonly evaluated using the Student’s t test for continuous
     variables with normal distribution, paired Wilcoxon rank-sum test
     for those with skewed distribution, and the Chi-squared test for
     categorical variables.

Figure 2.

   [71]Figure 2
   [72]Open in a new tab

   Results of normality test

   (A) Shapiro-Wilk test of metabolomic data (normality.test.csv).

   (B) Shapiro-Wilk test of clinical data (normality.test.csv).

   (C) Histogram and Q-Q plot of metabolomic data (normality.test.pdf).

   (D) Histogram and Q-Q plot of clinical data (normality.test.pdf).

Step 3: Spearman correlation analysis

     Inline graphic Timing: 10 min

   Correlation coefficients are used to measure the association between
   two variables. Pearson correlation is a common method used in the
   context of a linear relationship between two normally distributed
   continuous variables, while Spearman correlation can be used to measure
   a monotonic relationship. Considering of 1) some clinical parameters
   (such as smoke status/diet score) were presented by grade, 2)
   metabolite profiles were usually non-normal distribution, we use
   Spearman correlation analysis to assess the associations of individual
   metabolites with the baseline clinical parameters in our analysis. We
   provide “SC” function to calculate Spearman correlation coefficient.
     * 5.
       Spearman correlations analysis by “SC” function.
          + a.
            Calculate Spearman's ρ and p values of Spearman correlation
            analysis.
            > SCresult<-SC(data1=clinic.d, data2=meta.d)
            > SCmatrix<-SCresult[[1]]
            > pvalues<-SCresult[[2]]
          + b.
            Save Spearman's ρ and p values to “SCmatrix.csv”
            ([73]Figure 3A) and “SCpmatrix.csv” ([74]Figure 3B) file
            respectively.
            > write.csv (SCmatrix,"SCmatrix.csv")
            > write.csv (pvalues,"SCpmatrix.csv")

     * 6.
       Visualized by heatmap.

   > pdf("SC.pdf")

   > heatmap.2(SCmatrix,col=bluered(75),key=TRUE, symkey=FALSE,
   trace="none", cexRow=0.8, margins=c(20,20))

   > dev.off()

     Note: Results of Spearman correlation are visualized by a heatmap
     named “SC.pdf” ([75]Figure 3C). We use hierarchical clustering
     analysis (with complete linkage) to group metabolites based on
     absolute Spearman correlations as a measure of similarity.

Figure 3.

   [76]Figure 3
   [77]Open in a new tab

   Results of Spearman correlation analysis

   (A)Results of Spearman's ρ (SCmatrix.csv).

   (B) Results of p values (SCpmatrix.csv).

   (C) Heatmap of Spearman correlation analysis (SC.pdf).

Results interpretation

   Relationships with a two-tailed p value < 0.05 is considered
   statistically significant. Appropriate multiple tests of statistical
   significance could be considered according to the study design.

     Inline graphic CRITICAL: Pearson’s is the most popular correlation
     coefficient, and works well if a linear relationship is suggested.
     However, when data are non-normally distributed, a test of the
     significance of Pearson correlation coefficient may inflate type I
     error rates and reduce power. Normality checking (see “Data
     distribution test” part) would be helpful for deciding whether
     Pearson’s correlation analysis is applicable or not. For skewed
     data, log transformation might be helpful in some cases, but data
     transformations must be applied very cautiously. Pearson correlation
     coefficient can be calculated by changing method parameter
     “Spearman” in “SC” function into “Pearson”.

Step 4: Logistic regression analysis

     Inline graphic Timing: 30 min

   Regression analysis is a way to quantify trends in data. Logistic
   regression analysis uses the logit model to evaluate the association
   between binary dependent (response) variables and a set of independent
   (explanatory) variables. However, in the matched data with cases and
   matched controls, the unconditional logistic expression is biased with
   overestimated odds ratio (OR). Conditional logistic regression which
   allows the consideration of matching and stratification, is often used
   for analysis in the case-control study. The associations between
   metabolites and risk of disease were estimated using multivariable
   conditional logistic regression models, with adjustment of age, sex and
   other clinical parameters. ORs (95%CI) were presented as per SD
   increment of the metabolites. R package “survival” is required and “OR”
   function is provided for OR calculating.
     * 7.
       Define formular for logit model.

   >
   form1<-case∼meta+age+as.factor(sex)+CP8+as.factor(CP9)+as.factor(CP10)+
   CP13+as.factor(CP11)+as.factor(CP12)+CP14+CP1+CP2+CP3+CP6+strata(match)

     Inline graphic CRITICAL: An example formular was given below which
     adjusted for age, sex and other clinical parameters (CPs). The users
     should define their own formular according to their variables.

     * 8.
       Logistic regression analysis by “OR” function.
          + a.
            Calculate ORs.
            > ORresult<-OR(form=form1, meta.d=meta.d, clinic.d=clinic.d,
            outcome.d=outcome.d)
          + b.
            Save the results to “OR.csv”file ([78]Figure 4).
            > write.csv(ORresult,"OR.csv")

     Note: To validate the reliability of the identified metabolites
     associated with the disease’s onset, internal validation tests and
     random sampling were usually performed as sensitivity analysis. In
     our study, internal validation test in two randomly split subsets
     were performed. Moreover, 200 times of tests were conducted by
     randomly selecting 80% of all samples as a subset each time.
     “split_dataset” function is created for splitting dataset to 2 sub
     datasets in customized proportion. “random_dataset” function is used
     for generating random sampling dataset and “random_OR function” is
     used for conditional logistic analysis in random sampling dataset.
     [79]Figure 4 shows the workflow of conditional logistic regression
     and validation.

     * 9.
       Validate ORs in two randomly split subsets.
          + a.
            Generate two randomly split subsets as dataset1 and dataset2.
            > split.d<-split_dataset(n.seed=1, proportion.subdataset1 =
            0.5)

     Note: We use 50% participants in each subset as example
     (proportion.subdataset1 = 0.5). Users could define their size of sub
     datasets by setting the parameter “proportion.subdataset1”.
          + b.
            Calculate ORs for dataset1 and dataset2.
            >
            t_ORresult<-OR(form=form1,meta.d=split.d[[1]],clinic.d=split.d
            [[2]],outcome.d=split.d[[3]])
            >
            v_ORresult<-OR(form=form1,meta.d=split.d[[4]],clinic.d=split.d
            [[5]],outcome.d=split.d[[6]])
          + c.
            Export the results of 2 subsets to “dataset1_OR.csv” and
            “dataset2_OR.csv” ([80]Figure 4).
            > write.csv(t_ORresult,"dataset1_OR.csv")
            > write.csv(v_ORresult,"dataset2_OR.csv")

     * 10.
       Validate ORs in random sampling datasets.
          + a.
            Calculate and statistic ORs of random sampling datasets.
            > random_ORresult<-random_OR(times.random=200,
            percentage =0.8)

     Note: We use 200 times of tests for 80% randomly selected samples as
     example (times.random = 200, percentage = 0.8). Users could define
     their times of sampling and size of each sampling by setting the
     parameters “times.random” and “percentage”.
          + b.
            Save the results to “randomdataset_OR.csv” file.
            > write.csv(random_ORresult,"randomdataset_OR.csv")

     Note: In “randomdataset_OR.csv” file, the number of significant ORs
     (including p < 0.05 and fdr<0.05) was count to assess the
     repeatability of logistic analysis ([81]Figure 4).

Figure 4.

   [82]Figure 4
   [83]Open in a new tab

   Workflow and results of conditional logistic analysis

   Red faces represent cases in the study and black faces are the matched
   controls. ORs were calculated in full dataset and were validated in
   both sub datasets and random sampling datasets (OR.csv). Sub datasets
   were created by splitting the dataset into two independent subsets
   (dataset1_OR.csv and dataset2_OR.csv). Random sampling datasets were
   generated by randomly selecting samples of certain size within full
   datasets for multiple times (randomdataset_OR.csv).

Results interpretation

   p values for false discovery rate (FDR) were estimated using the
   Benjamini-Hochberg method. ORs with FDR < 0.05 are considered to be
   significant.

Step 5: Causal mediation analysis

     Inline graphic Timing: 9 h

   After previous Spearman correlation analysis (step 3) and logistic
   analysis (step 4), significant associations on metabolites-disease and
   metabolites-clinical risk factors have been found. However, it is still
   unclear that which clinical risk factors could modify the association
   between level of metabolites and disease risk. Causal mediation
   analysis can provide causal interpretation for effect of metabolites on
   disease’s onset.

   To calculate the total, direct and indirect effects of exposure on
   outcome, causal mediation analysis compared 2 regression models in
   which one model is conditioned on the mediator and the other is not
   ([84]Figure 5A). The following 5 steps are included: 1) examine
   associations between exposure and mediator (path exposure to mediator,
   the estimate a in [85]Figure 5A); 2) examine associations of mediator
   as independent variable with outcome (path mediator to outcome, the
   estimate b in [86]Figure 5A); 3) assess simple total effect of exposure
   on outcome (path exposure to outcome, the estimate c in [87]Figure 5A);
   4) assess the direct effect of exposure on outcome with mediators
   controlled (path exposure to outcome not through the mediators, the
   estimate c’ in [88]Figure 5A); and 5) calculate the indirect effect of
   exposure on outcome by mediators (path exposure to outcome through the
   mediators, the estimate a×b in [89]Figure 5A).[90]^5 It should be noted
   that total effect = direct effect + indirect effect, in other words,
   c = c’ + ab.

Figure 5.

   [91]Figure 5
   [92]Open in a new tab

   Causal mediation analysis

   (A) Mediation model.

   (B) Results of causal mediation analysis (causal mediation.csv).

   R package “mediation” and “CM” function we provided are required for
   causal mediation analysis.
     * 11.
       Define formulars for causal mediation analysis.

     Note: Three models are required for causal mediation analysis as
     follows: Y = cX + e1, M = aX + e2, and Y = c’X+ bM + e3, where Y as
     outcome, X as exposure, M as mediator, c as total effect of
     metabolites, c’ as direct effect of metabolites, and ab as indirect
     effect of metabolites mediated by risk factor. Here we give example
     formulars for performing causal mediation analysis on path of
     metabolites (exposure)-clinical risk factors (mediator)-disease
     (outcome). It helps us to get better understanding of potential
     biological roles of metabolites by assessing their indirect effects
     on disease operated through clinical risk factors. Users should
     define their own formulars basing on their hypothesis of exposure,
     mediator and outcome, respectively.
          + a.
            Define formular: M = aX + e2.
            > form.a<-clinic∼meta
          + b.
            Define formular: Y = c'X + bM + e3.
            > form.b<-case∼meta+clinic
          + c.
            Define formular: Y = cX + e1.
            > form.c<-case∼meta

     * 12.
       Perform causal mediation analysis by “CM” function.

   > CMresult<-CM (form.a, form.b, form.c)
     * 13.
       Write the results to “causal mediation.csv” file.

   > write.csv (CMresult,"causal mediation.csv", row.names=F)

Results interpretation

   The following criteria need to be satisfied for a variable to be
   considered as a mediator ([93]Figure 5B): 1) the exposure variable
   significantly affects the mediator (beta and p values of estimate a in
   “causal mediation.csv” file); 2) there is a significant relationship
   between the mediator and the outcome (beta and p values of estimate b
   in “causal mediation.csv” file); 3) the exposure variable significantly
   affects the outcome (beta and p values of estimate c in “causal
   mediation.csv” file). Direct effects ADE and corresponding p values are
   also given in “causal mediation.csv” file. ACME represents for indirect
   effects and P[acme] < 0.05 is defined as significant causal mediation
   relationships. The mediation proportion (prop in “causal mediation.csv”
   file) is used to assess how much of the effect of metabolites on
   disease is mediated through clinical measurements. Generally, mediated
   proportion >10% is recognized as substantial mediation effects.

Step 6: Variance partitioning analysis

     Inline graphic Timing: 10 min

   Variance partitioning is a statistical method to quantify the
   contribution of multiple sources of variation and decouple
   within/between-individual variation. Here we assessed the variance in
   metabolite levels explained by each clinical measurement at baseline by
   variance partitioning. R package “variancePartition” and “VP” function
   are required for variance partitioning calculation.
     * 14.
       Define formular for variance partitioning analysis.

   > form1<-meta∼age+sex+CP8+CP9+CP10+CP13+CP11+CP12+CP14+CP1+CP2+CP3+CP6

     Note: All clinical variables which might related to levels of
     metabolites were included here. Users should define their own
     formular according to clinical variables of interest.

     * 15.
       Perform variance partitioning analysis by “VP” function.
          + a.
            Calculate variance contribution and correlation type.
            > VPresult<-VP(form1)
            > vars<-Vpresult[[1]]
            > corr_type<-Vpresult[[2]]
          + b.
            Save the results of variance contribution and correlation type
            to “VP_var.csv” ([94]Figure 6B) and “VP_corrtype.csv”
            ([95]Figure 6C).
            > write.csv(vars,"VP_var.csv")
            > write.csv(corr_type,"VP_corrtype.csv")

     * 16.
       Find the biggest variance.
          + a.
            Extract the variance explained the most of the total
            variation.
            > name_max=array(,c(nrow(vars),1))
            > var_max=array(,c(nrow(vars),1))
            > corr_type_max=array(,c(nrow(vars),1))
            > for(I in 1:nrow(vars)){
            k<-which.max(vars[i,]) name_max[i,1]=colnames(vars)[k]
            var_max[i,1]=vars[i,k] corr_type_max[i,1]=corr_type[i,k]
            }
            >
            rownames(name_max)<-rownames(var_max)<-rownames(corr_type_max)
            <-rownames(vars)
            > colnames(name_max)="name"
            > colnames(var_max)="var"
            > colnames(corr_type_max)="corr_type"
          + b.
            Visualize the max variance contribution for each metabolite as
            bar plot ([96]Figure 6A).
            > pdf("VPplot.pdf")
            > VPplot(var=var_max,corr_type=corr_type_max)
            > dev.off()

Figure 6.

   [97]Figure 6
   [98]Open in a new tab

   Results of variance partitioning analysis

   (A) Bar plot of the variance which explained the most of the total
   variation of metabolite levels (VPplot.pdf).

   (B) Percentage of variance contribution for metabolites (VP_var.csv).

   (C) Sign of beta in linear model regression of clinical parameters and
   metabolites (VP_corrtype.csv).

Results interpretation

   Variance explained over 2% of total variation are recognized as
   prominent. If a metabolite can be rarely explained by traditional
   clinical measures, these metabolites might be meaningful as new
   prediction factors since it is hardly replaced by traditional clinical
   measures.

Expected outcomes

   13 files including 10 .csv files and 3 plots are expected through above
   6 steps and the details are listed below. All example files are
   available on Zenodo at [99]https://doi.org/10.5281/zenodo.7601709.
   Analysis Outcomes Example file name
   Data distribution test Results of Shapiro-Wilk test and Histogram and
   Q-Q plot for data normality test normality.test.csv ([100]Figures 2C
   and 2D)
   normality.test.pdf ([101]Figures 2A and 2B)
   Spearman correlation analysis Spearman correlation coefficients between
   clinical parameters and metabolites SCmatrix.csv ([102]Figure 3A)
   SCpmatrix.csv ([103]Figure 3B)
   Visualization SC_heatmap.pdf ([104]Figure 3C)
   Conditional logistic regression analysis The ORs (95%CI) per SD
   increment of disease risk for each metabolite OR.csv ([105]Figure 4)
   OR validation in 2 sub datasets dataset1_OR.csv ([106]Figure 4)
   dataset2_OR.csv ([107]Figure 4)
   OR validation in random sampling datasets randomdataset_OR.csv
   ([108]Figure 4)
   Causal mediation analysis Causal mediation analysis on
   metabolites-clinic risk factors-disease causal mediation.csv
   ([109]Figure 5B)
   Variance partitioning analysis variance contribution of clinic risk
   factors to metabolites VP_var.csv ([110]Figure 6B)
   VP_corr_type.csv ([111]Figure 6C)
   Visualization VPplot.pdf ([112]Figure 6A)
   [113]Open in a new tab

Limitations

   This protocol provides a useful strategy for processing of LC-MS based
   metabolomics data in prospective nested case-control studies with R.
   The aim of this protocol is deeply integrating metabolome profiling and
   clinical phenotypes, and exploring the relationship of clinical risk
   factors and metabolomic pathways in the development and progression of
   human diseases. In addition to general analysis strategies, we provide
   customization parameters as much as possible to add adaptability of
   this protocol. However, there are still some limitations in
   application.

   Firstly, metabolic pathway enrichment analysis is a popular way for
   metabolomic profiling data mining and visualization, which is helpful
   for exploring the critical pathways and biological significance of
   diseases. In our study, 50 target metabolites in particular pathway
   (amino acids and microbiota-related metabolites) were detected.[114]^1
   It’s unnecessary and improper to perform pathway enrichment in this
   case. However, pathway enrichment analysis is highly recommended for
   researchers who performed untargeted metabolomic detection which has no
   selection bias of metabolites. We didn’t provide method of pathway
   enrichment analysis in this paper. MetaboAnalyst
   ([115]www.metaboanalyst.ca) is a free web-based tool which provides
   approach to metabolic pathway analysis and enrichment analysis.[116]^6

   Secondly, in Logistic regression analysis, validation study is an
   important way to understand and mitigate information bias.[117]^7 It
   was conducted by sampling at random for internal validation in our
   logistic regression analysis in step 4. However, there’re many kinds of
   validation methods could be considered. Users may want to design
   validation study in other way, for example, sampling based on the true
   gold standard measure or stratified by specific variables. The codes
   available can be only used for performing conditional logistic
   regression analysis in a validation test which is sampled at random.
   Considering the universality of the code, we provide “OR” function for
   customized datasets, by which users can conduct logistic regression
   analysis by importing validation data of their own.

   Thirdly, in causal mediation analysis, bidirectional mediation analysis
   on both directions (direction 1: clinical risk
   factors-metabolites-disease; direction 2: clinical risk
   factors-disease-metabolites) could provide more robust evidence to
   support the hypothesis that metabolites causally impact the onset of
   disease or vice versa, than the causal mediation analysis. However,
   bidirectional mediation analysis on direction 2 demands metabolomics
   data at follow-up visit after disease occurs. Considering that only
   metabolites at baseline were detected in our prospective study, we only
   performed causal mediation analysis on direction 1. Users who need to
   conduct bidirectional mediation analysis could modify the formulars
   according to the study design.

   Finally, VariancePartition will divide the contribution over multiple
   variables that are strongly correlated. Therefore, including more
   correlated variables in the model will reduce the variance of
   individual factors . It is useful to include all variables in the first
   analysis and then drop variables that have minimal effect. Considering
   that personalized selection of variables according to individual
   research focus may be more appropriate, we didn’t provide workflow for
   variables dropping in this paper.

Troubleshooting

Problem 1

   Package install failure errors might be presented during packages
   installation when importing “functions.R” file in [118]step 1: datasets
   and codes import as follows, 1) “Package is not available” and 2)
   “failed to lock directory”.

Potential solution

     * •
       For error (1) “Package is not available”, R packages are primarily
       distributed as source packages, but binary packages (a packaging up
       of the installed package) are also supported. Try to install binary
       packages instead by using > Install.packages (pkgs = package,
       binary = T).
     * •
       The reason of error (2) “failed to lock directory” exists is that,
       the library directory is ‘locked’ to prevents any other process
       installing into that library to store previous version of the
       package. The solution is to remove the lock directory of this
       package manually.

Problem 2

   Error occurs as “cannot open file: No such file or directory” when
   required files (“functions.R” or input datasets) can’t be found in R
   working directory in [119]step 1: datasets and codes import.

Potential solution

     * •
       Make sure that the following files: “functions.R”,
       “metabolites.txt”, “clinic.txt” and “outcome.txt” are in the same
       folder. File locations in Chinese or other languages may not be
       recognized by RStudio, so ensure to use English location if running
       RStudio. Please set R working directory to the correct location by
       running the following command in R.

   > setwd (dir=“target direction”)
     * •
       If source codes and example files are downloaded on Zenodo, please
       unzip the file “example inputs” and extract all 3 example input
       files to the same location of “functions.R”.

Problem 3

   Warning of conversion failure might appear when generating a pdf file
   containing a plot in [120]step 3: spearman correlation analysis
   (SC_heatmap.pdf) and [121]step 6: variance partitioning analysis
   (VPplot.pdf).

Potential solution

   RStudio IDE which is set to use UTF-8 as its default encoding. If a
   plot contains Greek characters such as α or β, conversion failure
   occurs. The solution is changing encoding type of pdf file to
   “ISOLatin2.enc” by running command below.

   > pdf (file=filename, encoding= “ISOLatin2.enc”)

Problem 4

   There may be no response to conditional logistic regression analysis in
   [122]step 4: logistic regression analysis. “OR” function is only for
   conditional logistic regression analysis which required information of
   case-control matched ID in a case-control study. If no matched
   information is provided, no result will be generated when calling “OR”
   function.

Potential solution

   Please interrupt R to stop running “OR” function and check the matched
   ID in 2 steps. First, make sure case-control matched IDs are included
   in “match” column of “outcome.txt” file. Second, make sure matched IDs
   are assigned in the model for OR calculation as “case ∼ meta +
   strata(match)”.

Problem 5

   In [123]step 4: logistic regression analysis, zero is produced in OR
   results (OR.csv) if any error occurs in OR calculation according to our
   “OR” function.

Potential solution

   Missing values in metabolites data will throw an error. Therefore,
   missing values of metabolites need to be imputed. Metabolites with ≥20%
   missingness should be excluded and 50% of the lowest measured value or
   limit of detection are recommended for imputation.

Resource availability

Lead contact

   Further information and requests for resources should be directed to
   and will be fulfilled by the lead contact Jieli Lu
   ([124]ljl11319@rjh.com.cn).

Materials availability

   This study did not generate new unique reagents.

Acknowledgments