Abstract
The effects of coronavirus disease 2019 (COVID-19) primarily concern
the respiratory tract and lungs; however, studies have shown that all
organs are susceptible to infection by severe acute respiratory
syndrome coronavirus 2 (SARS-CoV-2). COVID-19 may involve multiorgan
damage from direct viral invasion through angiotensin-converting enzyme
2 (ACE2), through inflammatory cytokine storms, or through other
secondary pathways. This study involved the analysis of publicly
accessible transcriptome data from the Gene Expression Omnibus (GEO)
database for identifying significant differentially expressed genes
related to COVID-19 and an investigation relating to the pathways
associated with mitochondrial, cardiac, hepatic, and renal toxicity in
COVID-19. Significant differentially expressed genes were identified
and ranked by statistical approaches, and the genes derived by
biological meaning were ranked by feature importance; both were
utilized as machine learning features for verification. Sample set
selection for machine learning was based on the performance, sample
size, imbalanced data state, and overfitting assessment. Machine
learning served as a verification tool by facilitating the testing of
biological hypotheses by incorporating gene list adjustment. A
subsequent in-depth study for gene and pathway network analysis was
conducted to explore whether COVID-19 is associated with cardiac,
hepatic, and renal impairments via mitochondrial infection. The
analysis showed that potential cardiac, hepatic, and renal impairments
in COVID-19 are associated with ACE2, inflammatory cytokine storms, and
mitochondrial pathways, suggesting potential medical interventions for
COVID-19-induced multiorgan damage.
Introduction
Patients with coronavirus disease 2019 (COVID-19) experience various
respiratory issues. Acute respiratory distress syndrome (ARDS) due to
COVID-19 pneumonia is the primary cause of mortality and long-term lung
damage. Although the respiratory system is most commonly affected in
people infected with severe acute respiratory syndrome coronavirus 2
(SARS-CoV-2), the virus can impact any organ in the body [[26]1–[27]3].
Indeed, multiple organs are typically involved in critically ill
patients [[28]4]. In addition to classical symptoms of respiratory
distress, many patients with COVID-19 have systemic symptoms, including
cardiovascular, hepatic or renal failure as well as coagulation
disorders. Studies have reported organ damage involving the lungs (33%
of patients), heart (32%), kidneys (12%), liver (10%), pancreas (17%),
and spleen (6%); 66% of study participants had single- or multiorgan
system damage, and 25% of patients showed multiorgan damage with
varying degrees of overlap between various organs [[29]5].
Cardiac complications occur in 20–44% of inpatients and are an
independent risk factor for COVID-19-related death [[30]6]. Viral
invasion of cardiomyocytes [[31]7] or systemic inflammatory responses
without direct viral infiltration [[32]8] can cause myocarditis, which
can lead to heart failure and arrhythmias. Some patients with severe
COVID-19, including those who did not have underlying kidney problems
prior to the disease, acquire signs of kidney damage, with more than
30% of COVID-19 inpatients developing kidney damage [[33]9]. In
addition, multiple studies have reported the occurrence of liver damage
in COVID-19 patients, indicating that 2–11% of COVID-19 patients
develop liver comorbidities. Furthermore, in 16–53% of reported cases,
increases in alanine aminotransferase (ALT) and aspartate
aminotransferase (AST) levels occur during disease progression
[[34]10], which suggests that hepatocytes are damaged and that the
liver is inflamed.
An inflammatory cytokine storm is the most frequently reported
phenomenon in COVID-19. Inflammatory cytokines are immune responses
intended to kill pathogens; however, the hyperinflammatory state
associated with excessive production of cytokines can cause permanent
damage to cells and mitochondria and induce cell death, potentially
leading to further organ damage [[35]11]. Angiotensin-converting enzyme
2 (ACE2), a key enzyme of the renin-angiotensin-aldosterone system
(RAAS) that maintains homeostasis of blood pressure, electrolytes, and
the inflammatory response, is also a possible cause of COVID-19-related
damage to the lung [[36]12], heart [[37]13], liver [[38]14] and kidney
[[39]15] as SARS-CoV-2 enters cells through ACE2. Mitochondria are
another important target of SARS-CoV-2. Mitochondria, the main
production sites of adenosine triphosphate (ATP) [[40]16], are involved
in the regulation of cellular immunity, homeostasis, and cell survival
and death. There is evidence suggesting that SARS-CoV-2 hijacks the
mitochondria of immune cells, replicates within the mitochondrial
structure, and impairs mitochondrial dynamics, leading to cell death
[[41]17]. However, whether SARS-CoV-2 can impair organ function by
direct viral infection via ACE2, mitochondrial damage, or multiorgan
damage triggered by an inflammatory cytokine storm needs to be further
investigated.
SARS-CoV-2 causes an increase in mitochondrial DNA (mtDNA) levels
during infection that may trigger an excessive immune response and lead
to severe pathology in COVID-19, including multiorgan failure [[42]18].
While it is believed that mitochondrial antiviral signaling (MAVS)
interacts with different SARS-CoV-2 proteins, the SARS-CoV-2 M protein
inhibits MAVS protein aggregation, and the mitochondrial
membrane-anchored MAVS protein is a key factor in the cellular
antiviral defense system that further inhibits the innate antiviral
response [[43]19]. Immune evasion and hyperinflammation during COVID-19
can also be related to the disruption of mitochondrial quality
[[44]20]. In addition, patients with COVID-19 have reduced
mitochondrial oxidative phosphorylation (OXPHOS) and bioenergetics, and
COVID-19 is reportedly associated with inhibition of mitochondrial gene
transcription [[45]21]. SARS-CoV-2 infection hinders mitochondrial
bioenergetics, which in turn can trigger inflammasome activation.
Consequently, mitochondrial inhibition not only results in excessive
cytokine production but also exerts a substantial impact on organs that
heavily depend on mitochondrial energy production [[46]21].
Based on this evidence, we hypothesized that COVID-19 may cause damage
to the heart, kidney, and liver via mitochondrial dysfunction and
downstream responses in addition to direct viral infection and cytokine
storms. Systemic transcriptomic analysis was conducted to examine the
roles of mitochondria in COVID-19-related multiorgan damage ([47]Fig
1). Our analysis shows that it can be reasonably inferred that there is
a correlation between the mitochondrial damage caused by SARS-CoV-2
infection and further deterioration of heart, kidney, and liver
function, resulting in multiorgan damage. This study enables
understanding of the causes of multiorgan complications caused by
SARS-CoV-2 and the development of treatment regimens.
Fig 1. Summary flow chart.
[48]Fig 1
[49]Open in a new tab
The analysis in this study starts with data collection on gene
expression, followed by gene transcriptome analysis to obtain
significantly expressed genes and toxicity analysis to further identify
relevant significantly expressed genes. Subsequently, machine learning
was used with the significantly expressed genes identified through
statistical and biological methods to validate the hypothesis that
SARS-CoV-2 would cause further damage to cardiac, hepatic, and renal
function by infecting mitochondria. Finally, we conducted a literature
review on the common genes, feature importance, and pathway analysis of
these transcriptomes to investigate how SARS-CoV-2 infects mitochondria
and damages cardiac, liver, and kidney function and drew conclusions.
Created with BioRender.com.
Materials and methods
Bioinformatics and machine learning tools were used to analyze multiple
publicly available RNA-Seq sample sets from clinical samples.
NetworkAnalyst is a comprehensive gene expression profiling and web
visualization analysis [[50]22]; Ingenuity Pathway Analysis (IPA)
provides analysis and development tools for genomics, proteomics, drug
toxicology, and metabolic and regulatory pathway studies [[51]23];
DAVID (Database for Annotation, Visualization and Integrated Discovery)
is a web-based tool for functional evaluation of the gene expression
data [[52]24–[53]26]; ClueGO [[54]27] is a Cytoscape plug-in for
deciphering functionally grouped gene ontology and pathway annotation
networks [[55]28]; GSEA (Gene Set Enrichment Analysis) is applied to
assess the distribution trend of genes in a specific gene set arranged
in a gene table based on their correlation to the phenotype to
determine its contribution to the phenotype [[56]29, [57]30]. Python
packages for machine learning sklearn (1.2.2), imblearn (0.10.1), and
XGBoost (1.7.5) were run in Python 3.10.11. The cnetplot is a function
from the clusterProfiler R package, commonly used for visualizing gene
set enrichment analysis results. This function creates a category net
plot that combines a category enrichment plot with a gene network plot.
The version of clusterProfiler is 4.8.3 and runs in R 4.3.1; Venn
diagrams and feature importance analysis tools were used to analyze
specific genes to identify biological pathways and important molecules
in the heart, kidney, and liver that correlate with mitochondria
affected by SARS-CoV-2.
Data availability
The raw counts of the RNA-seq data were obtained from the NCBI Gene
Expression Omnibus (GEO) database. The selection of sample sets is
determined by several factors, including the sample size for machine
learning, the kind of human tissue being studied, and the presence of
sufficient differentially expressed genes to facilitate further
analysis. Prior to selecting these five sample sets, we conducted tests
on additional sample sets. Nevertheless, the majority of these datasets
exhibit a limited number of differentially expressed genes, typically
fewer than 100 or slightly exceeding this threshold. Hence, after the
evaluation based on the criteria, the selected sample sets comprised
[58]GSE152075, [59]GSE163151, [60]GSE157103, [61]GSE169241, and
[62]GSE152641 for significant differentially expressed genes and
toxicity analysis.
For the selection of sample sets of machine learning, our criteria were
machine learning performance, sample size, imbalanced data state, and
overfitting assessment. We also performed a comparison of
[63]GSE152075, [64]GSE163151, [65]GSE157103, and [66]GSE152641 for
machine learning performance and a complete analysis.
The sample sets [67]GSE152075 and [68]GSE163151 were obtained from
nasopharyngeal swabs, [69]GSE157103 from leukocytes, [70]GSE169241 from
human heart autopsy tissues, and [71]GSE152641 from whole blood
([72]Table 1). These five sample sets were first tested for differences
between tissues. We then performed machine learning on [73]GSE152075
with a larger sample size to validate the effect of COVID-19 on
mitochondria and to identify further effects on the heart, kidney, and
liver.
Table 1. Summary of RNA sequence sample sets of COVID-19 in Gene Expression
Omnibus.
GEO accession GSE title Tissue Platform Sample size
[74]GSE152075 In vivo antiviral host transcriptional response to
SARS-CoV-2 by viral load, sex, and age [[75]31] Nasopharyngeal swab
[76]GPL18573 Illumina NextSeq 500 (Homo sapiens) 484
(COVID: control = 430: 54)
[77]GSE152075 Summary
1. SARS-CoV-2 triggered an antiviral response driven by interferon and
concurrently decreased the transcription of ribosomal proteins.
Levels of B cells and neutrophils were higher in patients with a
lower viral load.
2. A decrease in the expression of chemokines CXCL9/10/11, cognate
receptors CXCR3, CD8A, and granzyme B was observed in elderly
individuals.
3. B- and NK-cell-specific transcripts were reduced and NF-κB
inhibitors increased in males.
[78]GSE163151 A diagnostic host response biosignature for COVID-19 from
RNA profiling of nasal swabs and blood [[79]32] Nasopharyngeal swab
[80]GPL24676 Illumina NovaSeq 6000 (Homo sapiens) 149
(COVID: control = 138: 11)
[81]GSE163151 Summary
1. A two-step classifier based on machine learning that was run on an
individual test set of NP swab samples was able to segregate
COVID-19 and non-COVID-19 and infectious or noninfectious acute
respiratory disease with 85.1%-86.5% accuracy.
2. SARS-CoV-2 infection has unique biologic features and differences
between NP swabs and whole blood that can be used in the
differential diagnosis of COVID-19.
[82]GSE169241 hPSC-derived cells to model macrophage-mediated
inflammation in COVID-19 Hearts [[83]33] Human heart autopsy tissues
[84]GPL24676 Illumina NovaSeq 6000 (Homo sapiens) 8
(COVID: control = 3:5)
[85]GSE169241 Summary
1. COVID-19 patients showed a significant increase in CCL2 expression
and macrophage infiltration in cardiac tissue.
2. After SARS-CoV-2 exposure, macrophages induced increased
cardiomyocyte reactive oxygen species and apoptosis via secretion
of IL-6 and TNF-α.
[86]GSE157103 Large-scale multiomic analysis of COVID-19 severity
[[87]34] Leukocyte [88]GPL24676 Illumina NovaSeq 6000 (Homo sapiens)
126
(COVID: control = 100: 26;
ICU: non-ICU = 50: 50)
[89]GSE157103 Summary
1. Coagulation-related proteins, as well as cellular fibronectin
(cFN), were significantly increased in COVID-19 patients.
2. The abundance of prothrombin, which is cleaved to form thrombin
during coagulation, was significantly reduced and correlated with
severity, which may help to elucidate the hypercoagulable
environment of SARS-CoV-2 infection.
[90]GSE152641 Transcriptomic similarities and differences in host
response between SARS-CoV-2 and other viral infection [[91]35] Whole
blood [92]GPL24676 Illumina NovaSeq 6000 (Homo sapiens) 86
(COVID: control = 62: 24)
[93]GSE152641 Summary
1. Changes in gene expression in the peripheral blood of COVID-19
patients correlated highly with changes in response to other viral
infections. However, two genes, ACO1 and ATL3, showed significantly
different changes in expression.
2. Some dynamic immune evasion and counter host responses are specific
to COVID-19. For example, CD56bright NK cells, M2 macrophages, and
total NK cells were increased in COVID-19.
[94]Open in a new tab
Processing of RNA sequencing data
NetworkAnalyst is utilized for gene expression analysis, and this study
required tuning of various parameters. To initiate gene expression
analysis using NetworkAnalyst, the mean is employed for gene-level
summarization. It simplifies and standardizes representation, reduces
random noise in the data, and makes it more amenable to downstream
analyses. Additionally, it helps mitigate issues with multiple testing
and operates under the assumption that each transcript contributes
equally to the gene’s activity.
Before performing differential expression analysis, filtering is
employed to enhance statistical power by eliminating genes that do not
exhibit a response. To obtain accurate and meaningful inferences from
differential expression analysis data, it is imperative to employ
appropriate normalization techniques.
Filtering aids in the elimination of information that is demonstrably
inaccurate or unlikely to be instructive. To modify the quantity of
genes excluded from subsequent analysis, it is necessary to set up the
parameters for variance and abundance filters. The variance filter
eliminates features with a variance percentile rank less than the
threshold of those with consistent expression values across
circumstances. In this instance, the variance filtering was configured
to a threshold of 15, in accordance with the default configuration of
NetworkAnalyst. Consequently, data in the lowest 15th percentile of
expression will be removed. The parameter for eliminating features with
counts below the set threshold is referred to as low abundance. In this
study, the default value of 4 was employed in NetworkAnalyst.
To accurately detect transcriptional differences and to guarantee that
the expression distributions of each sample are consistent throughout
the whole experiment, normalization is essential.
Log2-counts per million (Log2-CPM) is a normalization method commonly
used in RNA-seq data processing. Log transformation helps to compress
the range of data, ensuring that genes with high expression variability
do not disproportionately influence subsequent analyses.
The limma approach is commonly utilized in the context of differential
expression analysis because of its use of linear models, which
frequently leads to improved computational efficiency when compared to
alternative methods for differential expression analysis [[95]36]. The
adjusted p value was set to 0.05 [[96]37] and the log[2]-fold change to
1.5 [[97]38] to ensure that the differentially expressed genes being
identified were statistically significant and biologically meaningful
with sufficient variance. This makes the results more biologically
relevant and valuable for application. The volcano plots of the
significantly expressed genes in [98]GSE152075, [99]GSE169241,
[100]GSE157103, [101]GSE163151, and [102]GSE152641 show genes with
increased and decreased gene expression (S1 Fig in [103]S1 File)
[[104]39].
Gene set enrichment and pathway analysis
After obtaining the significantly expressed genes from NetworkAnalyst’s
differential gene expression analysis, we used Ingenuity Pathway
Analysis (IPA; version 84978992) for a comparative analysis of the five
sample sets and obtained toxicity lists for each sample set from the
Tox analysis.
DAVID, ClueGO, the cnetplot from the clusterProfiler package in R and
GSEA were used for pathway enrichment analysis and functional
annotation. The significantly expressed genes of these transcriptomes
were analyzed in terms of KEGG, REACTOME, and WIKIPATHWAYS in DAVID.
Genes are linked with enriched pathways with GO, KEGG, and REACTOME by
cnetplot. GSEA assesses how predefined gene sets are distributed within
a gene table sorted by their correlation with the phenotype, helping to
determine their impact on the phenotype. The gene set database utilized
in the analysis was h.all.v2023.1.Hs.symbols.gmt. A total of 1000
permutations were performed, with gene symbols being collapsed in the
database. The permutation type employed was phenotypic, and the chip
platform used was
Human_Gene_Symbol_with_Remapping_MSigDB.v2023.1.Hs.chip. ClueGO was
used for comprehensive pathway and biological analysis and functional
annotation of GO terms and pathways. The databases of GO, KEGG,
WIKIPATHWAYS, REACTOME Reactions, and REACTOME Pathways were included
in the ClueGO setting panel. The parameters encompass a p value
threshold established at 0.05, a GO hierarchy ranging from level 8 to
15, and a pathway selection criterion of 2 genes with 6% per pathway.
Other parameters were set to default. In addition, a Venn diagram was
used to identify common genes for analysis of the biological meaning in
the mitochondria, heart, kidney, and liver through their
transcriptomes.
Machine learning
The [105]GSE152075 sample set was selected for machine learning based
on several considerations. In numerous machine learning tasks, it is
generally observed that the performance of the model tends to improve
as the number of samples increases. This can be attributed to the fact
that the model is subjected to a larger volume of data, enabling it to
acquire a broader range of features and patterns. A greater number of
data typically leads to improved generalization capabilities of the
model.
When performing model training with a limited amount of data, there is
an increased likelihood of encountering the phenomenon known as
overfitting [[106]40]. Overfitting refers to a situation in which the
model demonstrates excellent performance when evaluated on the training
data but fails to generalize effectively when presented with fresh,
previously unseen data. This implies that the model could exhibit
excessive complexity and has inflexibly acquired the characteristics of
the training data, resulting in poor performance when applied to novel
data.
The utilization of diverse sample sets may yield disparate outcomes,
introducing complexity to the analysis. The [107]GSE169241 sample set
derives from human heart autopsy tissues, [108]GSE157103 from
leukocytes, and [109]GSE152641 from whole blood samples. Similar to
[110]GSE152075, [111]GSE163151 utilized nasopharyngeal swabs as the
source of samples. However, due to the severe imbalance of samples
between the COVID and control groups (138:11) in [112]GSE163151, even
employing techniques such as SMOTE to address the issue of imbalanced
data might not produce adequate outcomes.
We use several indexes, including F1-Score, MCC, and AUC, to evaluate
the performance of models, each with its specific formula,
significance, and application, particularly in scenarios with
imbalanced data.
The metric of F1-Score is the harmonic mean of precision and recall,
calculated as
[MATH: F1-Score=2×Precision×RecallPrecision+Recall
:MATH]
A higher F1-Score, closer to 1, indicates better model performance,
balancing the precision and recall, which is especially crucial in
imbalanced datasets. An F1-Score near 0 indicates poor model
performance.
MCC (Matthews Correlation Coefficient) is calculated using the formula:
[MATH:
MCC=TP×TN−FP×FNTP+FPTP+FNTN+FP
mrow>TN+F<
mi>N :MATH]
Where TP, TN, FP, and FN represent true positives, true negatives,
false positives, and false negatives, respectively. MCC values range
from -1 to +1. MCC is particularly valuable in imbalanced datasets as
it provides a balanced measure even when class distribution is skewed.
AUC (Area Under the Curve) refers to the area under the ROC (Receiver
Operating Characteristic) curve, AUC ranges from 0 to 1. AUC is less
sensitive to class imbalance, making it a robust measure for evaluating
models on imbalanced datasets.
In imbalanced data scenarios, these metrics are crucial as they offer
more comprehensive insights into a model’s performance than mere
accuracy. While a high accuracy might be misleading in such cases, a
high score in F1, MCC, and AUC indicates that the model effectively
handles both minority and majority classes, providing a holistic
assessment of its predictive capabilities across different class
distributions.
The machine learning comparison was conducted on four sample sets,
[113]GSE152075, [114]GSE163151, [115]GSE157103, and GSE1526414, with
the top 40 significantly expressed genes (S1 Table in [116]S1 File),
which is explained for the number of gene selection in the subsection
of sensitivity analysis of machine learning with varying significant
gene counts. The [117]GSE169241 sample set was excluded from the
comparison due to its small sample size. Based on the factors mentioned
above, the overall performance of [118]GSE152075 generally showed the
most optimized selection regarding machine learning performance, sample
size, imbalanced data state, and overfitting assessment (S2 Table in
[119]S1 File).
In addition, utilizing several sample sets can offer a more
comprehensive outlook; however, performing an in-depth examination of a
singular sample set facilitates a more intricate study and
comprehension. Therefore, we mainly utilized the [120]GSE152075 sample
set, which consists of 484 samples driven by the objective of ensuring
consistency and uniformity in the subsequent study.
With the selected genes from the [121]GSE152075 sample set as features,
machine learning methods were used to test whether the association
between COVID-19 and effects in the mitochondria, heart, kidney, and
liver could be predicted. Through these machine learning results, the
pathways and biological meanings of the genes were further analyzed.
We employed four machine learning algorithms, including XGBoost
[[122]41] with parameters for colsample_bytree = 0.9, learning_rate =
0.1, max_depth = 10, n_estimators = 50, random forest [[123]42] with
parameters for max_depth = 10, min_samples_split = 5, n_estimators =
100, logistic regression [[124]43] with parameters for C = 50, max_iter
= 5000 and SVM [[125]44] with parameters for kernel = ’rbf’, C = 100,
gamma = 0.01, probability = True, to test and predict COVID-19. The
K-fold cross-validation technique was utilized, with K-fold sets to 10.
Since the ratio of the experimental group to the control group in the
[126]GSE152075 sample set was 430:54, the control group size was
insufficient, which created a problem of imbalanced data. Thus, we
employed SMOTE (Synthetic Minority Oversampling Technique) to analyze
the samples in the minority category and add new samples to the sample
set. SMOTE is a method used to address imbalanced sample sets,
especially for oversampling minority classes. In imbalanced sample
sets, the number of samples from one class greatly outnumbers the
other. This results in many machine learning models being biased toward
the majority class as they try to maximize overall accuracy,
potentially neglecting or misclassifying the minority class. SMOTE
addresses the issue of class imbalance not by simply duplicating
samples from the minority class but by generating new synthetic
samples. For every sample in the minority class, it identifies
k-nearest neighbors, all of which belong to the same minority class. A
neighbor is then randomly chosen, and a synthetic sample is produced at
a random point between this sample and its selected neighbor. This
procedure is iteratively carried out until a desired sample count or
proportion for the minority class is reached. The synthesis method of
SMOTE can be described using the following mathematical formula:
Given a sample x[i] from the minority class, a random selection is made
from its k nearest neighbors, denoted as x[zi]. Next, a random number λ
between 0 and 1 is chosen. The new synthetic sample x[new] can be
generated using the formula:
[MATH:
xnew=<
msub>xi+λ×(xzi−xi) :MATH]
This method ensures that the new synthetic sample lies somewhere on the
line segment between the original sample and its chosen neighbor. Each
iteration might yield different results because λ is randomized.
SMOTE offers notable advantages in tackling data imbalance. Instead of
duplicating minority class samples, it produces synthetic samples,
enhancing the sample set’s diversity and reducing the risk of
overfitting. Furthermore, by expanding the minority class data, SMOTE
ensures that models can better grasp the nuances of this class, leading
to improved prediction accuracy [[127]45]. Using this method, the ratio
of the experimental group to the control group was 1:1.
SHAP (SHapley Additive exPlanations, 0.41.0) was used to analyze the
prediction interpretation of the contribution of each feature. We then
calculated the Shapley value of each feature to measure the
contribution of the feature to the prediction so that the contribution
of each feature could be understood in detail [[128]46].
The codes are available at
[129]https://github.com/ntumitolab/ML-RNA-Seq.
Results
After analyzing and processing the RNA sequencing raw count data for
sample sets [130]GSE152075, [131]GSE163151, [132]GSE169241,
[133]GSE157103 and [134]GSE152641, the significant genes of each sample
set were obtained and tested in machine learning, pathway analysis and
IPA-Tox analysis.
Sensitivity analysis of machine learning with varying significant gene counts
The initial investigation involved assessing the predictive
capabilities of the features identified by the statistical methodology
for COVID-19 to gain insights into the potential utility of machine
learning. Consequently, we ranked the significantly expressed genes
from the differential gene expression analysis of [135]GSE152075 from
NetworkAnalyst by their adjusted p value. The top 100 significantly
expressed genes from the differential expression analysis of
[136]GSE152075 (S3 Table in [137]S1 File) were ordered by the adjusted
p value and used for machine learning and pathway analysis. The machine
learning prediction power for predicting COVID-19 was tested on the top
100, 80, 60, 50, 40, 30, and 20 significantly expressed genes, and the
results showed that the accuracy, F1-Score, and AUC of the four machine
learning algorithms except SVM were near or above 90% for the top
30–100 significantly expressed genes ([138]Table 2). This implies 30
significantly expressed genes to be sufficient for good prediction
power. However, the overall performances of 40 significantly expressed
genes in these four machine learning algorithms were better than 30
significantly expressed genes. Therefore, we proceeded with 40 genes
for a more comprehensive pathway enrichment exploration.
Table 2. Machine learning results of the top 100, 80, 60, 50, 40, 30, 20
significantly expressed genes, and randomly selected 40 genes in
[139]GSE152075.
Features 100 genes 80 genes 60 genes 50 genes 40 genes 30 genes 20
genes Randomly selected 40 genes
Machine Learning
XGBoost Accuracy 0.963 0.963 0.965 0.963 0.969 0.961 0.950 0.851
Sensitivity 0.979 0.984 0.984 0.979 0.984 0.977 0.967 0.921
Specificity 0.833 0.796 0.815 0.833 0.852 0.833 0.815 0.296
Precision 0.979 0.975 0.977 0.979 0.981 0.979 0.977 0.912
F1-Score 0.979 0.979 0.980 0.979 0.983 0.978 0.972 0.917
MCC 0.812 0.807 0.819 0.812 0.842 0.804 0.758 0.225
AUC 0.973 0.972 0.978 0.974 0.972 0.970 0.943 0.657
Random Forest Accuracy 0.965 0.969 0.969 0.965 0.969 0.965 0.952 0.849
Sensitivity 0.988 0.991 0.991 0.988 0.991 0.988 0.984 0.919
Specificity 0.778 0.796 0.796 0.778 0.796 0.778 0.704 0.296
Precision 0.973 0.975 0.975 0.973 0.975 0.973 0.964 0.912
F1-Score 0.980 0.983 0.983 0.980 0.983 0.980 0.974 0.915
MCC 0.815 0.837 0.837 0.815 0.837 0.815 0.745 0.220
AUC 0.975 0.970 0.963 0.964 0.961 0.964 0.956 0.646
Logistic Regression Accuracy 0.938 0.924 0.913 0.909 0.899 0.897 0.798
0.698
Sensitivity 0.944 0.935 0.916 0.909 0.902 0.898 0.781 0.726
Specificity 0.889 0.833 0.889 0.907 0.870 0.889 0.926 0.481
Precision 0.985 0.978 0.985 0.987 0.982 0.985 0.988 0.918
F1-Score 0.964 0.956 0.949 0.947 0.941 0.939 0.873 0.810
MCC 0.737 0.676 0.669 0.667 0.628 0.631 0.487 0.143
AUC 0.976 0.974 0.973 0.974 0.976 0.962 0.943 0.691
SVM Accuracy 0.880 0.851 0.829 0.820 0.800 0.725 0.643 0.446
Sensitivity 0.867 0.847 0.821 0.807 0.786 0.700 0.607 0.426
Specificity 0.981 0.889 0.889 0.926 0.907 0.926 0.926 0.611
Precision 0.997 0.984 0.983 0.989 0.985 0.987 0.985 0.897
F1-Score 0.928 0.910 0.895 0.889 0.875 0.819 0.751 0.577
MCC 0.638 0.546 0.511 0.517 0.480 0.408 0.337 0.023
AUC 0.975 0.969 0.968 0.967 0.962 0.957 0.962 0.654
[140]Open in a new tab
A randomly selected 40 genes from all genes of [141]GSE152075 as
machine learning features were also tested as a baseline to validate
the result of assessing the predictive capabilities of the features
identified by the statistical methodology for COVID-19. The result
showed that the machine learning predictive capabilities cannot be
established without pre-process feature selection.
The top 40 significantly expressed genes in the [142]GSE152075 sample
set ([143]Fig 2A) were then analyzed in DAVID for KEGG, REACTOME, and
WIKIPATHWAYS pathway analysis. COVID-19- and SARS-CoV-2-related
pathways were identified as some of the major pathways from these top
40 significantly expressed genes. In addition, pathways related to
inflammatory cytokine storms, such as interferon signaling [[144]47],
interferon alpha/beta/gamma signaling, and cytokine signaling in the
immune system, were identified ([145]Fig 2B), accounting for a large
proportion of the pathway analysis results in ClueGo ([146]Fig 2C).
Genes linked with enriched pathways by cnetplot shows GO enrichment
analysis for the top 40 DEGs in Biological Process (BP) is mainly
virus-related pathway ([147]Fig 2D), KEGG is coronavirus
disease—COVID-19 ([148]Fig 2E), and REACTOME is interferon-related
signaling ([149]Fig 2F). Interferon alpha/gamma signaling can also be
seen in GSEA ([150]Fig 2G). Hence, it can be inferred that inflammatory
cytokine storms are strongly associated with COVID-19. Using these top
40 significantly expressed genes as machine learning features may
effectively predict COVID-19.
Fig 2. Pathway analysis of the top 40 significantly expressed genes in
[151]GSE152075.
[152]Fig 2
[153]Open in a new tab
(A) List of the top 40 significantly expressed genes in the
[154]GSE152075 sample set. (B) DAVID analysis results of the top 40
significantly expressed genes in the REACTOME pathways. (C) Network of
pathways from ClueGO for the top 40 significantly expressed genes
showing that SARS-CoV-2 and inflammatory cytokine storm-related
interferon signaling pathways are the main terms by group among the top
40 significantly expressed genes. Cnetplot of (D) GO enrichment
analysis for the top 40 DEGs in Biological Process, (E) KEGG, and (F)
REACTOME. (G) GSEA results for enrichment analysis of the top 40
significantly expressed genes.
Tox analysis of mitochondria, heart, kidney, and liver
The toxicity lists were obtained by performing Tox analysis on the
significantly expressed genes in the five sample sets ([155]Table 1)
via IPA. The samples from different tissues and sampling platforms
showed different common genes ([156]Fig 3) and toxicity lists,
indicating that the pathogenesis of COVID-19 is tissue specific.
Fig 3. The common genes across the RNA-seq datasets.
[157]Fig 3
[158]Open in a new tab
Top 100 significantly expressed genes were compared between RNA-seq
datasets [159]GSE152075, [160]GSE163151, [161]GSE169241, [162]GSE157103
and [163]GSE152641. The Venn diagram result indicates that the
pathogenesis of COVID-19 is tissue specific.
The nasopharyngeal swab data from [164]GSE152075 and [165]GSE163151
each showed the largest number of genes in common between these five
sample sets and the toxic effects of COVID-19 on the heart, kidney,
liver, and mitochondria ([166]Fig 4A and 4B). The human heart tissue
samples from [167]GSE169241 showed the toxic effects of COVID-19 mainly
on the heart and mitochondria ([168]Fig 4C). While the data on
leukocytes from [169]GSE157103 and on whole-blood samples from
[170]GSE152641 showed that COVID-19 had toxic effects on the heart,
kidney, and liver, the toxic effects on mitochondria were not
significant ([171]Fig 4D and 4E). However, the toxicity lists obtained
from the [172]GSE157103 leukocyte samples with ICU and non-ICU toxicity
analysis showed a toxic mitochondrial effect in addition to the toxic
cardiac, renal, and hepatic effects ([173]Fig 4F), which suggests that
mitochondrial dysfunction is associated with disease progression in
COVID-19 patients [[174]48].
Fig 4. IPA comparative analysis of the RNA-seq data.
[175]Fig 4
[176]Open in a new tab
Differentially expressed genes between the COVID-19 patients and the
control groups were compared in the comparative IPA analysis for
toxicity lists and toxicity functions. The following RNA-seq sample
sets were obtained from the NCBI Gene Expression Omnibus database
(GEO): [177]GSE152075 (A, nasopharyngeal swab), [178]GSE163151 (B,
nasopharyngeal swab), [179]GSE169241 (C, human heart autopsy tissues),
[180]GSE157103 (D, leukocyte), and [181]GSE152641 (E, whole blood). ICU
patients and non-ICU patients were compared in the [182]GSE157103
sample set (F).
Machine learning of genes associated with the mitochondria-, heart-, kidney-,
and liver-related toxicity list
In IPA-tox analysis of the [183]GSE152075 sample set, we identified the
differentially expressed genes (DEGs) associated with the
mitochondria-related toxicity list ([184]Table 3) under the criteria of
-log(p value) > 1.3 for further machine learning analysis.
Mitochondrial dysfunction was the relevant pathway identified by the
mitochondria-related toxicity list, which included 32 DEGs in
[185]GSE150275. Similarly, significantly expressed genes were
identified from the heart-, kidney-, and liver-related toxicity lists
([186]Table 3). Among them, the heart-related toxicity lists included
cardiac fibrosis [[187]49] and cardiac necrosis/cell death pathways
[[188]50]; there were 38 genes in these toxicity lists. The
kidney-related toxicity list included renal necrosis/cell death
[[189]51], increases renal nephritis [[190]52] panel (PSTC) [[191]53],
and increases glomerular injury pathways [[192]54]; there were 55 genes
in these toxicity lists. The liver-related toxicity lists included
increases in liver hepatitis [[193]55], liver necrosis/cell death
[[194]56] and increases in liver damage pathways [[195]57]; there were
42 genes in these toxicity lists ([196]Table 3). These identified genes
from the toxicity lists were compared to the top 100 and top 40
significantly expressed genes, and only a few common significantly
expressed genes were identified, including NDUFV1 and PRDX5 in
mitochondrial dysfunction and RRAD, CIB1, CYBB, and SLC8A1 in the
heart. There were many differences in gene expression between those
identified by the statistical meaning method and those identified by
the biological meaning method.
Table 3. Genes associated with mitochondria-, heart-, kidney-, and
liver-related toxicity list and top significantly expressed genes in
[197]GSE152075.
Genes associated with mitochondrial-related toxicity list (32 genes)
Genes associated with heart-related toxicity list (38 genes) Genes
associated with renal-related toxicity list (55 genes) Genes associated
with liver-related toxicity list (42 genes)
Related toxicity list Mitochondrial Dysfunction Cardiac Fibrosis
Cardiac Necrosis/Cell Death Renal Necrosis/Cell Death
Increases Renal Nephritis
Renal Safety Biomarker Panel (PSTC)
Increases Glomerular Injury Increases Liver Hepatitis
Liver Necrosis/Cell Death
Increases Liver Damage
Genes in toxicity list ACO2, ATP5F1E, ATP5ME, BAD, CLIC2, COX4I1,
COX5A, COX5B, COX6A1, COX7B, CYC1, FIS1, GPX1, GPX4, GSTP1, MT-ND6,
NDUFA11, NDUFA13, NDUFA2, NDUFA4, NDUFAB1, NDUFB10, NDUFB2, NDUFB7,
NDUFS6, NDUFV1, PRDX5, SURF1, TOMM7, UQCR11, UQCRC1, UQCRQ CYBB, TLR4,
PRKCB, SLC8A1, TLR2, GPX1, ACE2, DVL1, SLC2A1, STEAP3, CIB1, NDUFS6,
MB, CARD6, JUN, LMNA, KLF15, LARP6, DAG1, USP18, NEXN, FLT1, PIN1,
NUB1, CTNNB1, PROX1, S100A6, CCN1, RRAD, CASP1, BAD, NDUFA13, SOCS3,
KLF4, NTN1, SERPINF1, JUND, ABCC9 MIF, TLR2, CLU, CTNNB1, FLT1, TLR7,
TNFSF13B, TNFSF14, OLR1, LRP5, CYBB, GPX4, ZEB1, TLR4, P2RX7, KMO,
ACE2, SLC2A1, FOS, LAMB2, PRKCB, MEFV, APRT, HSPA1A, HSPA1B, NDUFAB1,
CASP1, STUB1, BAD, ALDH3B1, MLKL, BCL2L14, RFXANK, IDO1, SLC8A1, PRDX2,
IER3, GNB2, BIRC3, CITED2, NTN1, GSTP1, ERBB2, ZBP1, APOBEC3A, AIM2,
FCGR3A, FCGR3B, CX3CR1, TFF3, DDR1, CD274, JUN, C3AR1, CCR1 TLR4, GBP5,
CASP1, CCL4, TLR2, TLR7, JUN, TNFSF14, ALDH3A1, P2RX7, KMO, MIF, BSG,
CCR5, IL2RG, ATG4B, AIM2, CCN1, EPHA2, CCL2, KEAP1, CHCHD2, PROS1,
SELL, KRT8, CTNNB1, FOS, CXCL10, IRF8, CD274, TKT, BAD, FGL2, GADD45B,
SIGIRR, SOCS3, IER3, BIRC3, USP18, PTPRC, JUND, PHB2
Common genes with top 100 significantly expressed genes NDUFV1, PRDX5
RRAD, CIB1, CYBB, SLC8A1 TNFSF13B, CCR1, CYBB, SLC8A1 ALDH3A1, CXCL10,
GBP5
Common genes with top 40 significantly expressed genes PRDX5 RRAD _
CXCL10
[198]Open in a new tab
Next, these DEGs in the mitochondria-, heart-, renal-, and
liver-related toxicity lists were used as features to test the
prediction efficiency of COVID-19 in machine learning models. The
results showed that the gene sets in the mitochondria-, heart-, renal-,
and liver-related toxicity list could mostly provide accuracy,
F1-Score, and AUC of the four machine learning algorithms except SVM,
which were near or above 90% ([199]Table 4). These machine learning
results demonstrated that the toxicity analyses of the mitochondria,
heart, liver, and kidney correlated with COVID-19 and were sufficient
for machine learning to predict COVID-19.
Table 4. Machine learning results of genes associated with the mitochondria-,
heart-, kidney-, and liver-related toxicity list in [200]GSE152075.
Feature Genes selected from Mitochondrial dysfunction toxicity list (32
genes) Genes selected from Heart toxicity list (Cardiac Fibrosis +
Cardiac Necrosis/Cell Death) (38 genes) Genes selected from Renal
toxicity list (Renal Necrosis/Cell Death + Increases Renal Nephritis +
Renal Safety Biomarker Panel (PSTC) + Increases Glomerular Injury) (55
genes) Genes selected from Liver toxicity list (Increases Liver
Hepatitis + Liver Necrosis/Cell Death + Increases Liver Damage) (42
genes)
Machine Learning
XGBoost Accuracy 0.948 0.938 0.938 0.942
Sensitivity 0.981 0.970 0.974 0.967
Specificity 0.685 0.685 0.648 0.741
Precision 0.961 0.961 0.957 0.967
F1-Score 0.971 0.965 0.965 0.967
MCC 0.723 0.678 0.668 0.708
AUC 0.859 0.913 0.913 0.933
Random Forest Accuracy 0.944 0.952 0.940 0.944
Sensitivity 0.991 0.993 0.986 0.981
Specificity 0.574 0.630 0.574 0.648
Precision 0.949 0.955 0.949 0.957
F1-Score 0.969 0.974 0.967 0.969
MCC 0.687 0.738 0.664 0.697
AUC 0.922 0.955 0.960 0.969
Logistic Regression Accuracy 0.905 0.897 0.907 0.909
Sensitivity 0.930 0.907 0.912 0.914
Specificity 0.704 0.815 0.870 0.870
Precision 0.962 0.975 0.982 0.982
F1-Score 0.946 0.940 0.946 0.947
MCC 0.574 0.600 0.647 0.652
AUC 0.863 0.898 0.919 0.939
SVM Accuracy 0.936 0.899 0.771 0.723
Sensitivity 0.979 0.919 0.756 0.693
Specificity 0.593 0.741 0.889 0.963
Precision 0.95 0.966 0.982 0.993
F1-Score 0.964 0.942 0.854 0.816
MCC 0.646 0.574 0.437 0.425
AUC 0.935 0.949 0.937 0.966
[201]Open in a new tab
To determine whether COVID-19 may further impair cardiac, hepatic, and
renal function due to mitochondrial dysfunction, we added the 32
significantly expressed genes from the mitochondria-related toxicity
lists to the significantly expressed genes of heart-, liver-, and
kidney-related toxicity lists. After combining the lists, there were 66
significantly expressed genes related to the mitochondria and heart, 83
significantly expressed genes related to the mitochondria and kidney,
and 73 significantly expressed genes related to the mitochondria and
liver. The results of this round of machine learning indicated that the
accuracy, F1 score, and AUC of all the machine learning algorithms,
including SVM, using the three sets of transcriptomes were above 90%
([202]Table 5). We found that adding the significantly expressed genes
in the mitochondria-related toxicity list to those of the heart-,
liver-, and kidney-related toxicity lists improved the prediction
powers in machine learning models. In particular, the accuracy of the
machine learning algorithm SVM of significantly expressed genes in the
heart-, kidney-, and liver-related toxicity lists all improved notably
(S2 Fig in [203]S1 File). Thus, we can infer that COVID-19 may cause
further damage to cardiac, liver, and kidney function by damaging
mitochondria.
Table 5. Machine learning results of genes associated with mitochondria and
other organ-related toxicity lists.
Features Genes selected from Mitochondrial dysfunction + Heart (Cardiac
Fibrosis + Cardiac Necrosis/Cell Death) toxicity list in [204]GSE152075
(66 genes) Genes selected from Mitochondrial dysfunction + Renal (Renal
Necrosis/Cell Death + Increases Renal Nephritis + Renal Safety
Biomarker Panel (PSTC) + Increases Glomerular Injury) toxicity list in
[205]GSE152075 (83 genes) Genes selected from Mitochondrial dysfunction
+ Liver (Increases Liver Hepatitis + Liver Necrosis/Cell Death +
Increases Liver Damage) toxicity list in [206]GSE152075 (73 genes)
Machine Learning
XGBoost Accuracy 0.955 0.946 0.934
Sensitivity 0.986 0.984 0.965
Specificity 0.704 0.648 0.685
Precision 0.964 0.957 0.961
F1-Score 0.975 0.970 0.963
MCC 0.755 0.707 0.661
AUC 0.905 0.912 0.952
Random Forest Accuracy 0.957 0.957 0.963
Sensitivity 0.991 0.993 0.998
Specificity 0.685 0.667 0.685
Precision 0.962 0.960 0.962
F1-Score 0.976 0.976 0.979
MCC 0.764 0.763 0.799
AUC 0.957 0.964 0.970
Logistic Regression Accuracy 0.919 0.940 0.940
Sensitivity 0.933 0.953 0.951
Specificity 0.815 0.833 0.852
Precision 0.976 0.979 0.981
F1-Score 0.954 0.966 0.966
MCC 0.657 0.727 0.732
AUC 0.906 0.922 0.953
SVM Accuracy 0.917 0.911 0.913
Sensitivity 0.935 0.926 0.926
Specificity 0.778 0.796 0.815
Precision 0.971 0.973 0.975
F1-Score 0.953 0.949 0.950
MCC 0.638 0.628 0.641
AUC 0.952 0.936 0.969
[207]Open in a new tab
Common gene analysis of the genes associated with mitochondria-, heart-,
kidney-, and liver-related toxicity
Finally, we identified the common genes among the significantly
expressed genes from the mitochondria-related toxicity list and the
three sets of significantly expressed genes from the heart-, kidney-,
and liver-related toxicity lists for further analysis. The common genes
in the mitochondria- and heart-related toxicity lists were NDUFS6,
NDUFA13, GPX1, and BAD; the common genes in the mitochondria- and
kidney-related toxicity lists were GPX4, GSTP1, NDUFAB1, and BAD. BAD
(BCL2-associated agonist of cell death) was the only common gene
between the mitochondria-related toxicity list and heart-, kidney-, and
liver-related toxicity lists ([208]Fig 5). The BAD protein, belonging
to the Bcl-2 gene family, is a proapoptotic member involved in
initiating apoptosis, which might explain the cell death in various
tissue types and contribute to further pathogenicity and organ damage
[[209]58].
Fig 5. Common genes associated with mitochondria-, heart-, kidney-, and
liver-related toxicity lists.
[210]Fig 5
[211]Open in a new tab
Regarding the common genes between the mitochondria-related toxicity
lists and heart- and kidney-related toxicity lists, GPX1 and GPX4 were
associated with oxidative stress; NDUFAB1, NDUFS6, and NDUFA13 were
associated with OXPHOS; and GSTP1 was associated with the NRF2-mediated
oxidative stress response. In addition to mitochondrial dysfunction,
oxidative stress and the NRF2-mediated oxidative stress response were
included in the [212]GSE152075 toxicity lists. OXPHOS generates
reactive oxygen species (ROS) [[213]59]. The presence of excess ROS
during the regulation of intracellular signaling may cause irreversible
damage to cellular components and trigger apoptosis through the
mitochondrial intrinsic apoptotic pathway [[214]60]. Therefore,
oxidative stress can cause apoptosis via a mitochondria-dependent
pathway [[215]61].
SHAP was used to determine the feature importance of significantly
expressed genes in the mitochondria-, heart-, kidney-, and
liver-related toxicity lists to analyze which genes have the greatest
impact. The top ranking feature importance of the significantly
expressed genes CXCL10, ATP5F1E, and ACE2 ([216]Fig 6A) identified
using the biological meaning method in the mitochondria-, heart-,
kidney-, and liver-related toxicity lists. CXCL10 is associated with
cytokine storms [[217]62] and is an important chemokine [[218]63].
Furthermore, it is reported to be an exceptional prognostic biomarker
for COVID-19 patients [[219]64, [220]65]. ACE2 is an entry receptor for
SARS-CoV-2 and is also associated with mtDNA depletion and
mitochondrial dysfunction [[221]66]. ATP5F1E encodes a subunit of
mitochondrial ATP synthase. A significant increase in expression of
ATP5F1E in COVID-19 patients has been reported [[222]67], which might
be related to elevated production of ROS and increased inflammation
[[223]68, [224]69]. The network of pathways analysis from ClueGO
involved the selection of the top feature importance of significantly
expressed genes with SHAP values higher than the average. These genes
include CXCL10, RRAD, USP18, ATP5F1E, CIB1, C3AR1, CYBB, PROS1, ACE2,
STUB1, UQCRQ, MT-ND6, SLC8A1, NDUFV1, COX5A, FLT1, NDUFA13, NDUFB7,
BAD, ATP5ME, NDUFAB1, LAMB2, SOCS3, PHB2, TFF3, and KLF15. The analyzed
pathways show that the genes from mitochondria-, heart-, kidney-, and
liver-related toxicity lists are closely related to COVID-19 and
mitochondria, including the mitochondrial immune response to
SARS-CoV-2, COVID-19 adverse outcome pathway, oxidative
phosphorylation, type II interferon signaling, regulation of IFNA/IFNB
signaling, and electron transport chain:OXPHOS system in mitochondria
([225]Fig 6B). Genes linked with enriched pathways by cnetplot shows GO
enrichment analysis for top feature importance of significantly
expressed genes with SHAP values higher than the average in Biological
Process (BP) is oxidative phosphorylation and ATP-related pathway
([226]Fig 6C), KEGG is also oxidative phosphorylation ([227]Fig 6D),
and REACTOME is ATP-related pathway ([228]Fig 6E). Thus, the
top-ranking feature importance significantly expressed genes from
mitochondria-, heart-, kidney-, and liver-related toxicity lists were
identified by a biological meaning approach for machine learning
analysis, and the main determinants included factors related to
COVID-19, immune response and mitochondria. Several interferon
signaling genes, such as IFIT1, IFIT2, IFIT3, IFI5 and CXCL10, which
are also associated with inflammatory cytokine storms and immune
responses, were among the top 40 significantly expressed genes.
Therefore, if the top significantly expressed genes were identified by
a statistical meaning approach for machine learning analysis, the main
determinants included factors also related to the inflammatory cytokine
storm and immune response.
Fig 6. Feature importance analysis.
[229]Fig 6
[230]Open in a new tab
(A) Feature importance is based on SHAP values of the genes from the
mitochondria-, heart-, kidney-, and liver-related toxicity lists. (B)
ClueGO network analysis of the top features with SHAP values higher
than the average. SHAP was also used to identify feature importance
genes in XGBoost to further elucidate the heart, kidney, and liver
damage caused by mitochondrial infection triggered by COVID-19.
Cnetplot of (C) GO enrichment analysis for the top features with SHAP
values higher than the average in Biological Process, (D) KEGG, and (E)
REACTOME.
Discussion
Machine learning is often applied for prediction or classification.
Here, we employ machine learning in a reverse sense: we first
formulated a hypothesis and then used the data that met the hypothesis
to perform machine learning. If the accuracy of the prediction was
high, the hypothesis had a high probability of being correct in terms
of logical inference, which would provide an interpretation for the
results obtained through machine learning.
The main objective of this study was to investigate the effects of
COVID-19 on mitochondrial infection and subsequent damage to the heart,
kidneys, and liver. To establish the correlation, two aims were
hypothesized and then tested to determine the hypothetical validity.
One aim used the statistical meaning approach and generated the
prediction for machine learning, which would then be used as a
baseline. This approach also allowed us to understand the reliability
and rationality of machine learning when used on these sample sets. The
other aim used the biological meaning of COVID-19 to analyze and
elucidate the main hypothesis of our study, which regarded the
infection of mitochondria by COVID-19 and subsequent cardiac, renal,
and liver damage. Therefore, this study employed machine learning to
evaluate our hypotheses.
The first aim used a statistical meaning approach to find relevant
genes to be used as machine learning features. The machine learning
results showed that by selecting the top significantly expressed genes
for machine learning, it was possible to predict COVID-19 effectively,
but this approach did not directly show the effect of COVID-19 on
mitochondria, which might be because other factors have more direct
effects. For example, inflammatory cytokine storms constituted a
significant proportion in the DAVID analysis of the top 40
significantly expressed genes. Inflammatory cytokine storms have a
strong connection with COVID-19, which causes multiorgan damage
[[231]70]. Machine learning can effectively predict COVID-19 using the
top 40 significantly expressed genes as features.
The second aim tested the machine learning features selected from
biological meaning, since the importance of individual genes in
biological pathways and biological meaning do not entirely reflect the
level of gene expression (i.e., few genes from the selected toxicity
lists were shown in the top 100 significantly expressed genes). We used
[232]GSE152075 toxicity analysis to identify significantly expressed
genes in the mitochondria-, heart-, kidney-, and liver-related toxicity
lists. We then used machine learning to validate whether the
significantly expressed genes identified by the biological meaning
approach are able to predict COVID-19 and whether the infection of
mitochondria by COVID-19 might have further biological meaning for
potential cardiac, kidney, and liver damage. The results of the final
analysis implied a correlation between the impact of COVID-19 on
mitochondria and further cardiac, renal, and hepatic impairment.
Therefore, we concluded that the effect of COVID-19 on mitochondria is
associated with the potential impairment of cardiac, hepatic, and renal
functions.
Machine learning has been widely applied to the diagnosis of COVID-19
patients. For example, the [233]GSE152075 sample set has been used in
other machine learning studies that used automated ML (AutoML)
[[234]71] and XGBoost for feature selection [[235]72]. Among the 24
selected feature genes (IGFBP2, KRT8, RPLP0, XAF1, RPL13, OAS2, CES1,
RPL4, EEF1G, NR2F6, RPS8, RPL10A, SNX14, C5orf15, TNFRSF19, CD24,
ALAS1, CEP112, C9orf24, POLR2J3, AAMP, DUOX2, EMCN, and RPL3), keratin,
type II cytoskeletal 8 (KRT8) was the only common gene out of the
significantly expressed genes in the mitochondria-, heart-, kidney-,
and liver-related toxicity lists obtained from our analysis of
[236]GSE152075 gene expression data. Maleknia et al. [[237]73] employed
the least absolute shrinkage and selection operator (LASSO) regression
model to perform feature selection, and nasopharyngeal swab sample sets
from [238]GSE163151, [239]GSE152075, [240]GSE156063, and [241]GSE188678
were applied. Random forest classification was used for training
prediction. The common genes shared between those selected via feature
selection using this LASSO regression model (COPA, CXCL11, IFI6, MIF,
NUCB1, SAMHD1, SIGLEC1, and TMED9) and the top 100 significantly
expressed genes of [242]GSE152075 were interferon alpha inducible
protein 6 (IFI6), C-X-C motif chemokine ligand 11 (CXCL11), and sialic
acid binding Ig-like lectin 1 (SIGLEC1), all of which are related to
the immune response. There was one gene in common when screening
against all the significantly expressed genes in the [243]GSE152075
mitochondria-, heart-, kidney-, and liver-related toxicity lists, which
was macrophage migration inhibitory factor (MIF). Since different
feature selection methods yield different genes, the biological
pathways or biological meaning they represent can vary.
In our study, significantly expressed genes obtained from the
statistical meaning approach and from the biological meaning approach
had few genes in common. However, both machine learning results were
predictive of COVID-19, and only the interpretation they presented
differed [[244]74]. As the features identified using the biological
meaning approach were related to the mitochondria, heart, kidney, and
liver, the interpretation of the machine learning results implied that
the effects of SARS-CoV-2 on ACE2 and mitochondria were associated with
further impairment of the heart, liver, and kidneys. One interpretation
for the machine learning results using the statistical approach to
feature selection was that SARS-CoV-2 triggers an inflammatory cytokine
storm, which in turn impairs cardiac, hepatic, and renal function.
BAD, the only gene in common between the mitochondria-related toxicity
list and the heart-, kidney-, and liver-related toxicity lists, is a
regulator of apoptosis, and BAD mRNA expression is found in many
tissues (heart, liver, spleen, lung, kidney, hypothalamus, pituitary,
uterus, and ovary) [[245]75]. Apoptosis occurs via the extrinsic death
receptor pathway or the intrinsic intracellular pathway, which
ultimately leads to mitochondrial dysfunction. In hepatocytes, the
convergence of these cell death pathways also requires mitochondrial
damages for effective apoptosis. The mitochondrial pathways of cell
death are regulated by interactions among the Bcl-2 protein family
members [[246]76]. Liver biopsies from SARS patients have shown that
SARS-CoV may induce apoptosis of hepatocytes, leading to liver damage
[[247]77]. Apoptosis leads to the deterioration of cardiac
contractility observed in COVID-19 patients [[248]78]. There is also
evidence showing that COVID-19 causes renal tubular damage due to
mitochondrial damage and apoptosis [[249]79]. Therefore, mitochondrial
dysfunction may be an important factor in apoptotic cell death that
causes cardiac, kidney, and liver damage, and COVID-19 is an important
cause of mitochondrial dysfunction.
Analysis of these common genes also showed that SARS-CoV-2 can hijack
pathways, such as that of oxidative stress, after mitochondrial
infection. Mitochondria are the main source of free radicals that are
responsible for oxidative stress. If the antioxidant system fails to
neutralize these free radicals in a timely manner, oxidative stress
occurs, resulting in damage to cells and tissues. Mitochondria and
mitochondrial DNA are targets of oxidative stress, as both the membrane
structure and the inner components of mitochondria are susceptible to
oxidative damage. When oxidative stress damages mitochondria, it
affects cellular energy production and metabolism, which in turn
affects all of the biological functions of cells and tissues. This
stress causes apoptosis, which in turn is associated with impairment of
cardiac, liver, and kidney function.
Our analysis revealed a correlation between the mitochondrial effects
of COVID-19 and further impairment of cardiac, hepatic, and renal
function. Mitochondrial dysfunction was determined to be a key factor
in COVID-19 [[250]80]. In addition, the analysis of gene expression in
A549 and Calu3 cell lines infected with SARS-CoV-2 revealed an increase
in the expression of genes related to cytokine production, inflammatory
responses, mitochondria and autophagic processes [[251]81].
Mitochondria cause cardiac dysfunction and myocyte damage via loss of
metabolic capacity as well as via production and release of viral
factors [[252]82]. As our study focused on the analysis of cardiac,
renal, and hepatic damage caused by the infection of mitochondria by
SARS-CoV-2, we then used machine learning to validate whether the
significantly expressed genes associated with mitochondria could be
used to predict COVID-19 and to further analyze the damage to the
heart, kidney, and liver caused by mitochondrial impairment.
Other studies have also shown that COVID-19 induces systemic host
responses and transcriptomic changes and that the resulting disruptions
affect the biological processes and functions of each organ system
[[253]83]. In addition, studies have shown that recovered patients show
symptoms of long COVID with multiorgan damage [[254]84–[255]86].
Therefore, further investigation into whether the mitochondrial effects
of COVID-19 also cause subsequent development of long COVID symptoms in
patients is necessary [[256]87].
Conclusions
An increasing number of case studies have recorded acute cardiac
manifestations in patients with COVID-19. A significant proportion of
patients diagnosed with COVID-19, with or without prior cardiovascular
disease, demonstrate high levels of troponin or creatine kinase,
suggesting myocardial injury, which in turn leads to cardiac
insufficiency and arrhythmia [[257]88]. The association between
SARS-CoV-2 infection, mitochondrial dysfunction, and subsequent
cardiovascular disease has been shown to be critical [[258]89]. Similar
findings have also been observed in other organs, such as the liver
[[259]90] and kidneys [[260]91]. In this study, we demonstrated that
SARS-CoV-2 can affect mitochondria, directly invade cells in various
organs via ACE2, and trigger a cytokine storm, which in turn impairs
cardiac, hepatic, and renal function ([261]Fig 7). It is also possible
that these correlations are because these pathways are related to each
other. For example, direct infection by SARS-CoV-2 via ACE2-dependent
pathways correlates with mitochondrial dysfunction [[262]92], and there
is a close association between mitochondrial dysfunction and
immunosenescence; this may lead to an increased possibility of
imbalance in the immune response to SARS-CoV-2 and may manifest as an
exaggerated proinflammatory response and a cytokine storm [[263]93],
resulting in further multiorgan damage.
Fig 7. Mechanisms of COVID-19 causing multiorgan damage.
[264]Fig 7
[265]Open in a new tab
SARS-CoV-2 can affect mitochondria, directly invade cells in various
organs via ACE2, and lead to cytokine storms, oxidative stress,
mitochondrial dysfunction and cell death, which can in turn impair
cardiac, liver, and kidney function. The figure was created with
BioRender.com.
This study identified significantly expressed genes in the
mitochondria-, heart-, kidney-, and renal-related toxicity lists from
Tox analysis for machine learning to validate their association with
COVID-19 and conclude that the mitochondrial infection caused by
COVID-19 further impairs cardiac, hepatic, and renal function. We also
obtained evidence for the correlation between genes, terms, and
pathways in the mitochondria, heart, liver, and kidneys during COVID-19
that have been demonstrated in other studies. The aim of this study was
to obtain the same conclusion using different methods that extended the
inquiry and provided further interpretation for these findings.
Although there are many possible mechanisms by which SARS-CoV-2 causes
multiorgan damage, including direct cellular invasion via ACE2 and
cytokine storms, which in turn impair cardiac, hepatic, and renal
function, the hypothesis of this study, the correlation between the
mitochondrial impact of COVID-19 and further cardiac, renal, and
hepatic impairment, which was tested using machine learning, holds
true.
By carrying out preliminary analysis through transcriptomics analysis,
using machine learning to validate the conclusions of the analysis
results, and cross-comparing reports, this analysis process identified
and validated the hypotheses. However, importantly, the characteristics
of the sample data used, such as the different tissues sampled and the
different detection platforms and sample sizes, may affect the
validation results. The trained machine learning models from
[266]GSE152075 with the top 40 significant genes were tested by other
sample sets, including [267]GSE163151, [268]GSE157103, and GSE1526414.
The result of prediction accuracy is notably poor, which implies that
if the tissues, gene expression detection platforms, or parameter
settings are different, the trained machine learning model cannot be
utilized in other sample sets (S4 Table in [269]S1 File). From this
study, it can be seen that the results of toxicity lists varied between
the different tissues sampled, and the results from the same tissues
could also vary depending on the sampling platform and the ratio or
size of the experimental and control samples. Further interpretation
and the use of machine learning to analyze the effects of different
sample tissues and different sample sizes on the hypothesis may be a
valuable field of interest for subsequent study.
Supporting information
S1 File
(PDF)
[270]Click here for additional data file.^ (598.3KB, pdf)
Data Availability
All RNA-seq files are downloaded from the NCBI Gene Expression Omnibus
(GEO) database (accession numbers GSE152075, and GSE163151, GSE157103,
GSE169241, GSE152641).
Funding Statement
This work was supported by the Center for Advanced Computing and
Imaging in Biomedicine (NTU-112L900701) from The Featured Areas
Research Center Program within the framework of the Higher Education
Sprout Project by the Ministry of Education (MOE) in Taiwan.
References