Abstract

   Non-alcoholic fatty liver disease (NAFLD) is a chronic liver disease
   that presents a great challenge for treatment and prevention.. This
   study aims to implement a machine learning approach that employs such
   datasets to identify potential biomarker targets. We developed a
   pipeline to identify potential biomarkers for NAFLD that includes five
   major processes, namely, a pre-processing step, a feature selection and
   a generation of a random forest model and, finally, a downstream
   feature analysis and a provision of a potential biological
   interpretation. The pre-processing step includes data normalising and
   variable extraction accompanied by appropriate annotations. A feature
   selection based on a differential gene expression analysis is then
   conducted to identify significant features and then employ them to
   generate a random forest model whose performance is assessed based on a
   receiver operating characteristic curve. Next, the features are
   subjected to a downstream analysis, such as univariate analysis, a
   pathway enrichment analysis, a network analysis and a generation of
   correlation plots, boxplots and heatmaps. Once the results are
   obtained, the biological interpretation and the literature validation
   is conducted over the identified features and results. We applied this
   pipeline to transcriptomics and lipidomic datasets and concluded that
   the C4BPA gene could play a role in the development of NAFLD. The
   activation of the complement pathway, due to the downregulation of the
   C4BPA gene, leads to an increase in triglyceride content, which might
   further render the lipid metabolism. This approach identified the C4BPA
   gene, an inhibitor of the complement pathway, as a potential biomarker
   for the development of NAFLD.

   Keywords: NAFLD, biomarker, machine learning, transcriptomics,
   lipidomics

1. Introduction

   Non-alcoholic fatty liver disease (NAFLD) is a form of chronic liver
   disease that affects 20–30% of the western population and approximately
   25% of the global population [[30]1,[31]2,[32]3] NAFLD is associated
   with a wide range of diseases, including increased visceral obesity and
   metabolomic abnormalities, such as insulin resistance, diabetes,
   hypertension, dyslipidemia, atherosclerosis and systemic
   micro-inflammation [[33]4,[34]5,[35]6,[36]7,[37]8,[38]9]. Currently,
   enhanced by an inactive lifestyle and unhealthy food culture, the
   spread of NAFLD has increased across countries among different age
   groups [[39]4,[40]10]. The disease has increased from 15% in 2005 to
   25% in 2010 with a subsequent increase in the number of obesity cases
   [[41]11]. It is also anticipated that there will be an increase in the
   number of NAFLD cases from 83.1 million (2015) to 100.9 million (2030)
   [[42]12].

   NAFLD consists of a spectrum of hepatic abnormalities ranging from
   steatosis or non-alcoholic fatty liver to the various levels of
   necrotic inflammation leading to non-alcoholic steatohepatitis (NASH)
   [[43]13]. The minority of NAFLD cases progresses to liver disease
   complications resulting to 4–8% deaths from cirrhosis complications and
   1–5% deaths from hepatocellular carcinoma [[44]11]. The initial stages
   of NAFLD, characterised by complex pathogenesis, include the
   accumulation of triglycerides in hepatocytes, which might further
   develop to conditions such as inflammation, fibrosis and cellular
   death, which are characteristics of NASH [[45]13]. It is considered
   that there are multiple factors which might lead to NAFLD
   [[46]13,[47]14]. NAFLD risk factors include an unhealthy diet and an
   inactive lifestyle and it is expected that the interaction between the
   genetic characteristics, diet and gut microbiota of an individual play
   a pivotal role in our understanding of the development and progression
   of the disease.

   Understanding the pathogenesis of NAFLD will cater a better
   understanding of the pathophysiological processes underlying the
   disease and will likely highlight potential therapeutic interventions
   [[48]4]. While the molecular mechanism, involved in the addition of
   fats in the liver, is not well understood, certain cytokines obtained
   from inflammation sites, particularly from extrahepatic adipose tissue,
   have been reported to induce this process [[49]15]. Hepatic de novo
   lipogenesis is also known to be a unique feature in steatosis [[50]15].
   Insulin resistance has also been reported to lead to metabolomic
   dysregulation in NAFLD that activates and aggravates hepatic steatosis
   [[51]13]. In total, 20–30% of NAFLD patients with simple steatosis
   progress to NASH [[52]13].

   In this study, we analyse various types of omics datasets, such as
   transcriptomics and lipidomics, in an effort to gain a better
   understanding of the NAFLD’s underlying pathophysiologic processes.
   Initially, we analysed transcriptomics data to identify potential gene
   biomarkers involved in the development of NAFLD and then proceeded to
   analysing lipidomics data so as to identify potential lipid biomarkers,
   as well as the pathways which are perturbed by these biomarkers.

2. Materials and Methods

   The schematic diagram presents the pipeline developed for the biomarker
   [[53]16,[54]17] identification using NAFLD-related transcriptomics and
   lipidomics datasets ([55]Figure 1).

Figure 1.

   [56]Figure 1
   [57]Open in a new tab

   (Created with BioRender.com) NAFLD biomarkers identification study
   design consisting of pre-processing step followed by differential
   expression analysis for feature selection and then univariate and
   multivariate analysis. The final step includes the results
   interpretation and pathway analysis.

2.1. Transcriptomics

2.1.1. Data Acquisition

   The datasets, [58]GSE151158, [59]GSE58979, [60]GSE63067, [61]GSE89632
   and [62]GSE33814, employed in this study were downloaded from the Gene
   Expression Omnibus (GEO) repository on 21 January 2021. In total, these
   datasets consisted of 146 samples, 81 of which were steatosis-related
   and 65 were control. The data were split into training, testing and
   validation sets and were subjected to pre-processing, normalization,
   data integration, batch-effect correction, PCA analysis, differential
   gene expression analysis, identification of common significant genes,
   as well as supervised analysis using random forest and biological
   interpretation.

   Each GEO dataset was downloaded and loaded into R (version 4.0.3) by
   using the getGEO function in the GEOquery package (version 2.58.0)
   [[63]18]. All datasets, apart from [64]GSE151158, were already
   normalised. [65]GSE151158 was normalised using the edgeR package
   (version 3.32.1) [[66]19] cpm function with a True log parameter.

2.1.2. Derivation of New Transcriptomics Cohort from Multiple GEO Datasets

   Due to the number of control and steatosis samples, in each GEO
   dataset, being low ([67]Table 1), [68]GSE151158, [69]GSE58979,
   [70]GSE63067 and [71]GSE89632 were integrated to derive a
   transcriptomics cohort that can be used for the downstream analysis,
   while [72]GSE33814 was kept for validation. Following the datasets’
   integration, in the derived cohorts, batch effects were identified
   using PCA (principle component analysis) plots. The PCs were generated
   using the function prcomp in the stats package and the PCA plots were
   visualized using ggbiplot (version 0.55) [[73]20]. The batch correction
   for the derived cohort was performed based on non-parametric adjustment
   using ComBat [[74]21], where batch effects due to different sequencing
   platforms were corrected. Following batch correction, PCA was performed
   to cater their visualisation.

Table 1.

   Details of the transcriptomics data used in the study.
   S.No GEO Number of Samples Number of Features Platform Reference
   1 [75]GSE89632 Control (n = 24) vs. Steatosis (n = 20) 29,377 Illumina
   HumanHT-12 WG-DASL V4.0 R2 expression beadchip [[76]22]
   2 [77]GSE151158 Control (n = 21) vs. Steatosis (n =23) 618 NanoString
   Human Immunology v2 Code Set
   (NS_Immunology_v2_C2328+PLS_Golden_1_C5164) [[78]23]
   3 [79]GSE58979 Control (n = 0) vs. Steatosis (n = 17) 49,395 Affymetrix
   Human Gene Expression Array [[80]24]
   4 [81]GSE63067 Control (n = 7) vs. Steatosis (n = 2) 54,675
   [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array [[82]25]
   5 [83]GSE33814 Control (n = 13) vs. Steatosis (n = 19) 48,803 Illumina
   HumanWG-6 v3.0 expression beadchip [[84]26]
   [85]Open in a new tab

2.1.3. Differential Gene Expression (DGE) Analysis

   The differential gene expression (DGE) analysis was performed using the
   lmFit and eBayes functions, available within the limma package (version
   3.46.0) [[86]27]. The application of the Benjamini–Hochberg (BH)
   correction method yielded a gene table consisting of the log fold
   change (logFC) and the adjusted p value. Significant genes with an
   adjusted p value less than 0.05 were then extracted. This gene list was
   further filtered to only include genes which were common between the
   derived cohort and validation set.

2.1.4. Random Forest-Based Predictions

   The derived cohort was split into a testing (control, 16; steatosis,
   27) and a training set (control, 36; steatosis, 64). The validation set
   consisted of 13 control and 19 steatosis samples.

   The random forest was set for repeated 10-fold cross validation with 5
   repeats. The parameters of the random forest were tuned by the
   expand.grid method for the factors mtry, (12), ntree (55) and maxnode
   (6) using the training data and train function from the caret package
   (version 6.0.86) [[87]28]. The model was then tested and validated
   using the test and validation set.

   The accuracy of the model in classifying the steatosis samples for the
   testing and validation sets was evaluated based on the receiver
   operating characteristics (ROC) analysis. The area under the curve
   (AUC), sensitivity and specificity for both datasets were calculated by
   using the pROC package (version 1.17.0.1) [[88]29].

2.1.5. Downstream Data Analysis

   The Wilcoxon test was conducted on the training and validation datasets
   to investigate the genes which were upregulated and downregulated
   between the control and the steatosis. A further pathway and GO
   enrichment analysis was performed using enrichR (version 3.0) [[89]30].
   The correlation plot for the genes in the training dataset was plotted
   using the package corrplot (version 0.84) [[90]31].

2.2. Lipidomics Data Analysis

2.2.1. Data Acquisition

   Two cohorts (the Fenland cohort and the Italian cohort) were collected
   from Sanders et al., for identifying the biomarkers for NAFLD [[91]32].
   Both the cohorts consisted of clinical data and lipidomics data.

2.2.2. Data Pre-Processing

   The dataset was loaded into R (version 4.0.3) and was separated into
   clinical data and lipid expression data. Each of the expression data
   row was annotated according to the lipid names. The dataset was then
   scaled and the values of the missing features were imputed according to
   the feature mean.

2.2.3. Differential Lipid Expression Analysis

   A differential lipid expression analysis was performed on the Italian
   cohort using the lmFit and eBayes functions present in the limma
   package (version 3.46.0) [[92]27]. The sample number difference, for
   each sample type, resulted in a class imbalance for the Italian cohort
   where there were 120 samples in steatosis0 and 21 samples in
   steatosis1. To address this, the steatosis0 samples were separated into
   6 different batches, each containing 20 steatosis0 samples, ensuring
   that the samples in each of the 6 batches were unique and not repeated
   in other batches. The topTable function in limma obtained
   Benjamini–Hochberg (BH)-corrected p values and the logFC change of
   significant lipids between the steatosis0 and steatosis1 samples. A
   Volcano plot was plotted using the ggplot2 (version 3.3.3) [[93]33] for
   all the 6 different batches to visualize the differentially expressed
   lipids. The common significant lipids among these 6 different batches
   were obtained and were further subjected to a differential expression
   analysis using the 120 steatosis0 and 21 steatosis1 samples to obtain
   their logFC and p values.

2.2.4. Random Forest and ROC Curve Analysis

   The first random forest model was formed using the Italian cohort as
   the training set and the Fenland data as the test set. A stratified k
   fold cross validation approach was implemented, where the fold value
   was set to 5. The parameters of the random forest were tuned by using
   expand.grid for the factors mtry (12), ntree (150) and maxnode (6)
   using the training data and train function from caret package (version
   6.0.86) [[94]28]. The model was then tested and validated using the
   test and validation set.

   Then, a second random forest model was generated using the Fenland data
   as the training set and the Italian cohort as the test set. A
   stratified k fold cross validation approach was implemented, where the
   fold value was set to 5. The parameters of the random forest were tuned
   by expand.grid for the factors mtry (2), ntree (55) and maxnode (6)
   using the training data and train function from caret package (version
   6.0.86) [[95]28]. The model was then tested and validated using the
   test and validation set.

   The AUC, sensitivity and specificity for both the models was calculated
   by using the pROC package (version 1.17.0.1) [[96]29].

2.3. Statistical Analysis

   A Wilcoxon test was conducted using the rstatix package (version 0.7.0)
   [[97]34]. Since the samples were unpaired, the wilcoxon test paired
   option was set to false and the confidence level was set to 0.95. The p
   values from this test, for each of the significant features, were
   extracted and stored in a new data frame. A boxplot was then generated
   to understand the significant difference between the steatosis0 and
   steatosis1 samples.

2.3.1. Heatmap-Based Visualization of Significant Lipid Features

   Two heatmap were created using the ComplexHeatmap (version 2.6.2)
   [[98]35] package using 7 significant lipids. Heatmap 1 consisted of the
   120 steatosis0 samples, while heatmap 2 was formed using the 21
   steatosis1 samples. These two heatmaps were further combined to
   construct the complex heatmap that represented both the steatosis0 and
   steatosis1 samples variation across the lipids. The row title of the
   heatmap was set to the lipid names and the columns of the heatmap
   represents the steatosis0 and steatosis1 sample IDs.

2.3.2. LIPEA-Based Lipid Pathway Enrichment Analysis

   The LIPEA lipid pathway enrichment analysis [[99]36] employs the lipid
   compounds IDs contained in the KEGG Database (Kyto Encyclopedia of
   Genes and Genomes) and identifies significantly disrupted pathways by
   applying a Fisher’s exact test followed by an over representation
   analysis (ORA) for each pathway; an output table, consisting of the
   enriched pathways, the lipids involved in them and their p values, is
   then generated.

2.3.3. Lipid Network Analysis

   A network analysis was performed to obtain potential interactions
   between the various lipid classes. The R qgraph package (version 1.6.9)
   [[100]37] was used to generate the network, using the 7 significant
   lipids, with nodes, representing lipids, connected to weighted edges
   resembling the interaction between them. A Benjamini–Hochberg
   correction was implemented and the significant threshold was set to
   0.05.

3. Results

3.1. Feature Selection for Random Forest Model

3.1.1. Gene Signature Identification

   By merging [101]GSE151158, [102]GSE58979, [103]GSE63067 and
   [104]GSE89632, new NAFLD transcriptomics data were derived. [105]Figure
   2A depicts the PCA (principle component analysis) for the newly derived
   cohort before and after batch correction. To identify significant
   genes, a differential gene expression analysis was conducted over the
   derived transcriptomics datasets. In the derived transcriptomics cohort
   training data ([106]GSE151158, [107]GSE58979, [108]GSE63067 and
   [109]GSE89632), 173 genes were identified as significant (126 were
   upregulated and 47 were downregulated). Within the validation set
   ([110]GSE33814), there were 1971 significant genes (772 were
   upregulated and 1199 were downregulated). Between the training and the
   validation sets, there were 18 common significant genes (C9, HPRT1,
   TLR1, B2M, BAX, GAPDH, BTK, PTPN6, SERPING1, ITGAE, IL1RAP, MSR1,
   TNFRSF14, IL15, CX3CR1, TOLLIP, IFIH1 and C4BPA) forming the group that
   was used within the training and validation sets.

Figure 2.

   [111]Figure 2
   [112]Open in a new tab

   (A) A PCA plot that shows how the datasets behaved before and after the
   batch correction was implemented using ComBat. (B) The AUC values of
   the test and validation sets of the random forest model.

3.1.2. Lipid Signature Identification

   There were two lipid datasets, the Italian cohort and the Fenland
   cohort. Within the Italian cohort, 120 steatosis0 and 21 steatosis1
   samples were present, indicating a class imbalance. The steatosis0
   samples were separated into six batches and a differential expression
   analysis was conducted six different times using limma. A
   Benjamini–Hochberg (BH) correction was implemented on the differential
   lipids and the lipids with adjusted p value lesser than 0.05 were
   extracted. The number of significant lipids identified for the
   different batches were as follows: batch 1, 191; batch 2, 189; batch 3,
   171; batch 4, 129; batch 5, 42; batch 6, 57. There were 11 significant
   lipids which were identified in all six batches
   (Cholesterol,CE(16:0),DG(34:1),DG(36:2),TG(52:2),TG(52:3),TG(53:2),TG(5
   3:3), TG(53:6),TG(53:7),TG(54:2)) and, of those 11 significant lipids,
   2 lipids (Cholesterol, CE(16:0)) were upregulated and the remaining 9
   lipids were downregulated (DG(34:1),
   DG(36:2),TG(52:2),TG(52:3),TG(53:2),TG(53:3),TG(53:6),TG(53:7) and
   TG(54:2)) in the steatosis0 vs. steatosis1 samples.

3.2. Random Forest Model Performance

3.2.1. Transcriptomic Features Analysis

   The area under the receiver operating characteristic curve (AUC) was
   calculated for the random forest model to be 0.91. The prediction
   accuracy decreased within the validation data, with an AUC value of
   0.73 ([113]Figure 2B). We hypothesise that this prediction accuracy
   reduction might be a result of the reduced sample number.

3.2.2. Lipidomic Features Analysis

   There were seven (Cholesterol, CE(16:0), DG(36:2), TG(52:2), TG(52:3),
   TG(53:2) and TG(54:2)) lipids common to the and Italian and Fenland
   cohorts. These lipids were part of the first random forest model. The
   Italian cohort was split into training and testing sets and the Fenland
   cohort was kept as the validation set. There were 36 steatosis0 and 6
   steatosis1 samples in the test set and 633 steatosis0 and 222
   steatosis1 in the validation set.

   The resulting AUC value of random forest 1 for the test dataset is
   0.63. An accuracy increase was reported for the validation dataset,
   with an AUC value of 0.67 ([114]Figure 3A).

Figure 3.

   [115]Figure 3
   [116]Open in a new tab

   (A) Random forest 1 ROC curve that was formed using the Fenland cohort
   as the validation set. (B) Random forest 2 ROC curve that was formed
   using the Italian cohort as the validation set.

   The process was repeated by alternating the Italian cohort as the
   validation set and the Fenland cohort as the training and testing set.
   There were 190 steatosis0 and 67 steatosis1 samples in the test set and
   120 steatosis0 and 21 steatosis1 in the validation set.

   Once the parameters were tuned by using the training set, the second
   random forest model was validated using the test and validation set.
   The AUC values for the test and validation datasets are 0.74 and 0.72,
   respectively ([117]Figure 3B).

3.3. Downstream Analysis

3.3.1. Transcriptomic Feature Study

   A pairwise correlation among 18 genes, within the training dataset, was
   identified within the control and the steatosis samples ([118]Figure
   4). The colour determines the sign of the coefficient, where the red
   colour represents a positive effect and the blue colour indicates a
   negative one ([119]Figure 4A,B). The intensity of the colour increases
   proportionally to the magnitude of the correlation coefficient among
   the genes. When compared to the gene correlation matrix in the
   steatosis and control samples within the training data, the IFIH1 gene
   is positively correlated to BTK in the steatosis samples.

Figure 4.

   [120]Figure 4
   [121]Open in a new tab

   (A) A correlation gene matrixof control samples within the training
   dataset. (B) A correlation matrix of steatosis samples within the
   training dataset.

   A pathway enrichment analysis of the transcriptomics data identified
   the following pathways: complement and coagulation cascades (C9,
   SERPING1 and C4BPA), cytokine-cytokine receptor interaction (CX3CR1,
   IL15, TNFRSF14 and IL1RAP), herpes simplex virus 1 infection (IFIH1,
   BAX, TNFRSF14 and B2M), B cell receptor signaling pathway (BTK and
   PTPN6), Toll-like receptor signaling pathway (TLR1 and TOLLIP),
   JAK–STAT pathway (IL15 and PTPN6) and Human immunodeficiency virus 1
   infection and primary immunodeficiency pathways (BAX and B2M)
   ([122]Figure 5). Among the 18 genes within the training and the
   validation sets, 5 genes (HPRT1, C9, C4BPA, IL1RAP and TNFRSF14) were
   upregulated in both training and validation sets, whereas the other 13
   genes were inconsistent between the training and the validation
   samples. Five genes, namely, IL1RAP, TOLLIP, HPRT1, C9 and C4BPA, have
   been previously identified to be in relation with NAFLD or other
   inflammatory responses [[123]38,[124]39]. A Wilcoxon test on C9 and
   C4BPA genes revealed gene downregulation within the steatosis samples
   ([125]Figure 6 and [126]Figure 7).

Figure 5.

   [127]Figure 5
   [128]Open in a new tab

   Enrichment analysis of gene signatures. Legend represents the p values
   associated with the pathways.

Figure 6.

   [129]Figure 6
   [130]Open in a new tab

   A boxplot denoting the downregulation of the C9 gene in the steatosis
   samples.

Figure 7.

   [131]Figure 7
   [132]Open in a new tab

   A boxplot showing the downregulation of the C4BPA gene in the steatosis
   samples.

3.3.2. Lipidomics Feature Study

   The seven significant lipids, employed in the random forest model, were
   subjected to a Wilcoxon nonparametric statistical test and their
   corresponding p values were plotted. Among the seven lipids, TG (52.3)
   had the highest −log10 (p value). Further boxplots were generated based
   on these lipids, revealing that triglycerides were upregulated in the
   steatosis1 samples and downregulated in the steatosis0 samples
   ([133]Figure 8).

Figure 8.

   [134]Figure 8
   [135]Open in a new tab

   A boxplot denoting the upregulation of different lipid classes in
   steatosis1 when compared to steatosis0. The p value obtained from the
   Wilxocon test is significant.

   A heatmap was constructed for the seven lipids across the steatosis0
   and steatosis1 samples. The heatmap colours represent the coefficient
   signs and, more specifically, the red colour represents a positive
   effect and the blue colour indicates a negative effect. The majority of
   the steatosis0 samples exhibit a negative correlation effect on the
   triglycerides, whereas the steatosis1 samples had a positive
   correlation, supporting the notion that triglyceride upregulation could
   indicate NAFLD development ([136]Figure 9). The lipid network shows
   that most of the triglycerides were positively correlated with one
   another ([137]Figure 10).

Figure 9.

   [138]Figure 9
   [139]Open in a new tab

   A complex heatmap consisting of steatosis0 and steatosis1 samples.
   Colours depict whether the lipids are positively or negatively
   correlated to the sample.

Figure 10.

   [140]Figure 10
   [141]Open in a new tab

   A lipid network showing the interactions among various groups. The blue
   colour indicates a positive correlation, while the red colour indicates
   a negative correlation.

   A pathway enrichment analysis of the lipids revealed long-term
   depression, lipolysis in adipocytes, glycerolipid metabolism and
   insulin resistance as the most significant pathways in which those
   lipids are involved in ([142]Table 2) (38).

Table 2.

   Pathway enrichment analysis of the significant lipids and their p
   values.
   Pathway Name Pathway Lipids Converted Lipids (Number) Converted Lipids
   (Percentage) Converted Lipids (List) p-Value
   Long-term depression 3 2 50.00 [143]C00165, [144]C00641 0.0147783
   Regulation of lipolysis in adipocytes 6 2 50.00 [145]C00165,
   [146]C00422 0.0147783
   Glycerolipid metabolism 15 2 50.00 [147]C00422, [148]C00641 0.0421456
   Insulin resistance 4 2 50.00 [149]C00165, [150]C00422 0.0421456
   Fat digestion and absorption 8 2 50.00 [151]C00165, [152]C00422
   0.0800387
   Rap1 signaling pathway 1 1 25.00 [153]C00165 0.137931
   Chemokines signaling pathway 2 1 25.00 [154]C00165 0.137931
   Ras signaling pathway 2 1 25.00 [155]C00165 0.137931
   MAPK signaling pathway 1 1 25.00 [156]C00165 0.137931
   NF-kappa B signaling pathway 1 1 25.00 [157]C00165 0.137931
   [158]Open in a new tab

   Abbreviations: NAFLD, non-alcoholic fatty liver disease; ROC, receiver
   operating characteristics; PCA, principal component analysis; LIPEA,
   lipid pathway enrichment analysis; NASH, non-alcoholic steatohepatitis;
   GEO, Gene Expression Omnibus.

4. Discussion

   The NAFLD pathogenesis is complex and unhealthy lifestyle trends have
   substantially increased its health burden over the past few years
   [[159]11,[160]13]. Omics integrative analytics have been proposed as a
   promising approach to gain a better understanding of NAFLD’s biological
   underpinnings [[161]40]. In this study, we analysed publicly available
   transcriptomic datasets to identify potential novel gene biomarkers, as
   well as the pathways which are perturbed.

   The biomarkers identified by our transcriptomics analysis are primarily
   involved in immune-related pathways. Previous research studies have
   shown that NAFLD is related to an excessive activation of the immune
   system [[162]41]. C4BPA, one of the genes identified by our
   transcriptomics analysis of the NAFLD samples, is primarily involved in
   immune-related pathways and has been identified as a target by several
   disease studies [[163]42,[164]43,[165]44]. Research work has been
   conducted to study the defense function of C4BP against Influenza A
   Virus (IAV), an upper respiratory tract infection caused by the
   Influenza virus under the Orthomyxoviridae family which is known to
   cause the pandemic [[166]45]. The complement system is safeguarded by
   various regulatory proteins, C4BP being one such humoral regulator
   example, to avoid unnecessary inflammation events [[167]46].
   Furthermore, C4BP-IgM has also been suggested as a target for the
   treatment of gonorrhea [[168]47]. C9 and C4BPA have been further
   identified as key genes involved in the NAFLD development [[169]38].
   Moreover, IL1RAP and TOLLIP, involved in cytokine–cytokine interaction
   and Toll-like receptor signaling pathways, have been reported to play a
   key role in liver inflammatory diseases [[170]39,[171]48,[172]49].

   The complement system pathway is poorly characterised in NAFLD and NASH
   [[173]50]. It is indicated that the complement system can be activated
   by three different pathways: the classical, the alternative and the
   lectin pathways [[174]50]. The gene biomarker identified in our study
   is present in the complement and coagulation cascade pathway. C4BPA,
   one of the genes that was identified in this pathway, is also known as
   C4BP. It is indicated that C4BP is the main soluble inhibitor of the
   classical and the lectin pathways [[175]51,[176]52,[177]53]. If the
   classical and lectin pathways are activated, they lead to the formation
   of c3 and c5 convertase and c3 and c5 conversions are central reaction
   in the complement activation.

   The activation of the classical and lectin pathways leads to the
   apoptosis of hepatocytes which, in turn, renders the lipid metabolism;
   the improvement in controlling the apoptosis process might help in
   controlling NASH [[178]54].

   It is indicated that there is an increased level of c3 in obese
   individuals and the action of c3 convertase produces c3a and c3b. c3a
   has a short half life but is later converted into desArg c3a which has
   a longer half life [[179]55]. desArg c3a, also known as ASP
   (acylation-stimulating protein), is involved in the increase in
   triglycerides in the plasma by causing ASP resistance. Metabolic
   resistance has also been indicated to be shared between insulin and
   ASP, where the increase in insulin levels might be caused by obesity
   [[180]56].

   Our differential lipid expression analysis identified 11 significant
   lipids. The most common class of those lipids was triglycerides. We
   observed a triglyceride upregulation between the steatosis0 and
   steatosis1, which is in agreement with recent reports of
   hypertriglyceridemia being common in NAFLD patients
   [[181]57,[182]58,[183]59]. Due to the reason of over-nutrition or
   insulin resistance, triglyceride concentration within the liver becomes
   rendered and that might create an increase in the concentration of
   hepatic triglycerides, which leads to steatosis [[184]58]. It is
   essential that the triglycerides are exported from the liver in the
   form of VLDL; if this process is affected, it results in steatosis
   [[185]60]. Furthermore, the network construction has shown the positive
   correlation of triglycerides to the other lipid groups, cholesterol and
   diacylglycerol.

   The pathways associated with the identified significant lipids are
   involved in adipocyte lipolysis. It has been previously reported that
   elevated body mass causes fat cell lipolysis [[186]61]. This might
   further cause adipose tissue inflammation, which contributes to insulin
   resistance. Another function of insulin is to limit lipolysis by
   inhibiting HSL (hormone-sensitive lipase) [[187]62].

   In conclusion, our findings suggest that the downregulation of C4BP
   results in an activation of the lectin pathway in the complement system
   triggering the conversion of c3 to c3a and c3b by the action of c3
   convertase, thereby increasing the triglyceride levels, as shown in our
   study. This indicates that C4BP could be a potential biomarker linked
   to the complement system pathway, that would aid in the treatment of
   NAFLD.

   This study has several limitations. Firstly, a small number of
   transcriptomics samples were used to train and validate the model.
   Then, the dataset merging using ComBat might have led to the loss of
   information. Further studies with large sample sizes should be further
   conducted to validate our findings.

5. Conclusions

   We identified C4BPA, which activates the complement and coagulation
   pathway that renders lipid metabolism, as a potential NAFLD biomarker.

Author Contributions

   All authors have made a substantial, direct intellectual contribution
   to this study. Conceptualization, A.A.; methodology and investigation
   R.S., B.B. and A.A.; writing—original draft preparation, R.S.;
   writing—review and editing, R.S., B.B., G.V.G. and A.A.; supervision,
   G.V.G. and A.A.; project administration, R.S. and A.A. All authors have
   read and agreed to the published version of the manuscript.

Funding

   This work was funded by MRC Health Data Research UK (HDRUK/CFC/01), an
   initiative funded by UK Research and Innovation, Department of Health
   and Social Care (England) and the devolved administrations, and leading
   medical research charities. The views expressed in this publication are
   those of the authors and not necessarily those of the NHS, the National
   Institute for Health Research, the Medical Research Council or the
   Department of Health. AA and GVG also acknowledge support from the NIHR
   Birmingham SRMRC, Nanocommons H2020-EU (731032) and MAESTRIA (Grant
   agreement ID 965286).

Institutional Review Board Statement

   Not applicable.

Informed Consent Statement

   Not applicable.

Data Availability Statement

   The rmarkdown scripts and the html reports are available at:
   [188]https://github.com/Roshanshafiha/NAFLD-Multi-Omics-Data-Analysis,
   accessed on 15 March 2021.

Conflicts of Interest

   The authors declare no conflict of interest.

Footnotes

   Publisher’s Note: MDPI stays neutral with regard to jurisdictional
   claims in published maps and institutional affiliations.

References