Abstract The effects of coronavirus disease 2019 (COVID-19) primarily concern the respiratory tract and lungs; however, studies have shown that all organs are susceptible to infection by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). COVID-19 may involve multiorgan damage from direct viral invasion through angiotensin-converting enzyme 2 (ACE2), through inflammatory cytokine storms, or through other secondary pathways. This study involved the analysis of publicly accessible transcriptome data from the Gene Expression Omnibus (GEO) database for identifying significant differentially expressed genes related to COVID-19 and an investigation relating to the pathways associated with mitochondrial, cardiac, hepatic, and renal toxicity in COVID-19. Significant differentially expressed genes were identified and ranked by statistical approaches, and the genes derived by biological meaning were ranked by feature importance; both were utilized as machine learning features for verification. Sample set selection for machine learning was based on the performance, sample size, imbalanced data state, and overfitting assessment. Machine learning served as a verification tool by facilitating the testing of biological hypotheses by incorporating gene list adjustment. A subsequent in-depth study for gene and pathway network analysis was conducted to explore whether COVID-19 is associated with cardiac, hepatic, and renal impairments via mitochondrial infection. The analysis showed that potential cardiac, hepatic, and renal impairments in COVID-19 are associated with ACE2, inflammatory cytokine storms, and mitochondrial pathways, suggesting potential medical interventions for COVID-19-induced multiorgan damage. Introduction Patients with coronavirus disease 2019 (COVID-19) experience various respiratory issues. Acute respiratory distress syndrome (ARDS) due to COVID-19 pneumonia is the primary cause of mortality and long-term lung damage. Although the respiratory system is most commonly affected in people infected with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the virus can impact any organ in the body [[26]1–[27]3]. Indeed, multiple organs are typically involved in critically ill patients [[28]4]. In addition to classical symptoms of respiratory distress, many patients with COVID-19 have systemic symptoms, including cardiovascular, hepatic or renal failure as well as coagulation disorders. Studies have reported organ damage involving the lungs (33% of patients), heart (32%), kidneys (12%), liver (10%), pancreas (17%), and spleen (6%); 66% of study participants had single- or multiorgan system damage, and 25% of patients showed multiorgan damage with varying degrees of overlap between various organs [[29]5]. Cardiac complications occur in 20–44% of inpatients and are an independent risk factor for COVID-19-related death [[30]6]. Viral invasion of cardiomyocytes [[31]7] or systemic inflammatory responses without direct viral infiltration [[32]8] can cause myocarditis, which can lead to heart failure and arrhythmias. Some patients with severe COVID-19, including those who did not have underlying kidney problems prior to the disease, acquire signs of kidney damage, with more than 30% of COVID-19 inpatients developing kidney damage [[33]9]. In addition, multiple studies have reported the occurrence of liver damage in COVID-19 patients, indicating that 2–11% of COVID-19 patients develop liver comorbidities. Furthermore, in 16–53% of reported cases, increases in alanine aminotransferase (ALT) and aspartate aminotransferase (AST) levels occur during disease progression [[34]10], which suggests that hepatocytes are damaged and that the liver is inflamed. An inflammatory cytokine storm is the most frequently reported phenomenon in COVID-19. Inflammatory cytokines are immune responses intended to kill pathogens; however, the hyperinflammatory state associated with excessive production of cytokines can cause permanent damage to cells and mitochondria and induce cell death, potentially leading to further organ damage [[35]11]. Angiotensin-converting enzyme 2 (ACE2), a key enzyme of the renin-angiotensin-aldosterone system (RAAS) that maintains homeostasis of blood pressure, electrolytes, and the inflammatory response, is also a possible cause of COVID-19-related damage to the lung [[36]12], heart [[37]13], liver [[38]14] and kidney [[39]15] as SARS-CoV-2 enters cells through ACE2. Mitochondria are another important target of SARS-CoV-2. Mitochondria, the main production sites of adenosine triphosphate (ATP) [[40]16], are involved in the regulation of cellular immunity, homeostasis, and cell survival and death. There is evidence suggesting that SARS-CoV-2 hijacks the mitochondria of immune cells, replicates within the mitochondrial structure, and impairs mitochondrial dynamics, leading to cell death [[41]17]. However, whether SARS-CoV-2 can impair organ function by direct viral infection via ACE2, mitochondrial damage, or multiorgan damage triggered by an inflammatory cytokine storm needs to be further investigated. SARS-CoV-2 causes an increase in mitochondrial DNA (mtDNA) levels during infection that may trigger an excessive immune response and lead to severe pathology in COVID-19, including multiorgan failure [[42]18]. While it is believed that mitochondrial antiviral signaling (MAVS) interacts with different SARS-CoV-2 proteins, the SARS-CoV-2 M protein inhibits MAVS protein aggregation, and the mitochondrial membrane-anchored MAVS protein is a key factor in the cellular antiviral defense system that further inhibits the innate antiviral response [[43]19]. Immune evasion and hyperinflammation during COVID-19 can also be related to the disruption of mitochondrial quality [[44]20]. In addition, patients with COVID-19 have reduced mitochondrial oxidative phosphorylation (OXPHOS) and bioenergetics, and COVID-19 is reportedly associated with inhibition of mitochondrial gene transcription [[45]21]. SARS-CoV-2 infection hinders mitochondrial bioenergetics, which in turn can trigger inflammasome activation. Consequently, mitochondrial inhibition not only results in excessive cytokine production but also exerts a substantial impact on organs that heavily depend on mitochondrial energy production [[46]21]. Based on this evidence, we hypothesized that COVID-19 may cause damage to the heart, kidney, and liver via mitochondrial dysfunction and downstream responses in addition to direct viral infection and cytokine storms. Systemic transcriptomic analysis was conducted to examine the roles of mitochondria in COVID-19-related multiorgan damage ([47]Fig 1). Our analysis shows that it can be reasonably inferred that there is a correlation between the mitochondrial damage caused by SARS-CoV-2 infection and further deterioration of heart, kidney, and liver function, resulting in multiorgan damage. This study enables understanding of the causes of multiorgan complications caused by SARS-CoV-2 and the development of treatment regimens. Fig 1. Summary flow chart. [48]Fig 1 [49]Open in a new tab The analysis in this study starts with data collection on gene expression, followed by gene transcriptome analysis to obtain significantly expressed genes and toxicity analysis to further identify relevant significantly expressed genes. Subsequently, machine learning was used with the significantly expressed genes identified through statistical and biological methods to validate the hypothesis that SARS-CoV-2 would cause further damage to cardiac, hepatic, and renal function by infecting mitochondria. Finally, we conducted a literature review on the common genes, feature importance, and pathway analysis of these transcriptomes to investigate how SARS-CoV-2 infects mitochondria and damages cardiac, liver, and kidney function and drew conclusions. Created with BioRender.com. Materials and methods Bioinformatics and machine learning tools were used to analyze multiple publicly available RNA-Seq sample sets from clinical samples. NetworkAnalyst is a comprehensive gene expression profiling and web visualization analysis [[50]22]; Ingenuity Pathway Analysis (IPA) provides analysis and development tools for genomics, proteomics, drug toxicology, and metabolic and regulatory pathway studies [[51]23]; DAVID (Database for Annotation, Visualization and Integrated Discovery) is a web-based tool for functional evaluation of the gene expression data [[52]24–[53]26]; ClueGO [[54]27] is a Cytoscape plug-in for deciphering functionally grouped gene ontology and pathway annotation networks [[55]28]; GSEA (Gene Set Enrichment Analysis) is applied to assess the distribution trend of genes in a specific gene set arranged in a gene table based on their correlation to the phenotype to determine its contribution to the phenotype [[56]29, [57]30]. Python packages for machine learning sklearn (1.2.2), imblearn (0.10.1), and XGBoost (1.7.5) were run in Python 3.10.11. The cnetplot is a function from the clusterProfiler R package, commonly used for visualizing gene set enrichment analysis results. This function creates a category net plot that combines a category enrichment plot with a gene network plot. The version of clusterProfiler is 4.8.3 and runs in R 4.3.1; Venn diagrams and feature importance analysis tools were used to analyze specific genes to identify biological pathways and important molecules in the heart, kidney, and liver that correlate with mitochondria affected by SARS-CoV-2. Data availability The raw counts of the RNA-seq data were obtained from the NCBI Gene Expression Omnibus (GEO) database. The selection of sample sets is determined by several factors, including the sample size for machine learning, the kind of human tissue being studied, and the presence of sufficient differentially expressed genes to facilitate further analysis. Prior to selecting these five sample sets, we conducted tests on additional sample sets. Nevertheless, the majority of these datasets exhibit a limited number of differentially expressed genes, typically fewer than 100 or slightly exceeding this threshold. Hence, after the evaluation based on the criteria, the selected sample sets comprised [58]GSE152075, [59]GSE163151, [60]GSE157103, [61]GSE169241, and [62]GSE152641 for significant differentially expressed genes and toxicity analysis. For the selection of sample sets of machine learning, our criteria were machine learning performance, sample size, imbalanced data state, and overfitting assessment. We also performed a comparison of [63]GSE152075, [64]GSE163151, [65]GSE157103, and [66]GSE152641 for machine learning performance and a complete analysis. The sample sets [67]GSE152075 and [68]GSE163151 were obtained from nasopharyngeal swabs, [69]GSE157103 from leukocytes, [70]GSE169241 from human heart autopsy tissues, and [71]GSE152641 from whole blood ([72]Table 1). These five sample sets were first tested for differences between tissues. We then performed machine learning on [73]GSE152075 with a larger sample size to validate the effect of COVID-19 on mitochondria and to identify further effects on the heart, kidney, and liver. Table 1. Summary of RNA sequence sample sets of COVID-19 in Gene Expression Omnibus. GEO accession GSE title Tissue Platform Sample size [74]GSE152075 In vivo antiviral host transcriptional response to SARS-CoV-2 by viral load, sex, and age [[75]31] Nasopharyngeal swab [76]GPL18573 Illumina NextSeq 500 (Homo sapiens) 484 (COVID: control = 430: 54) [77]GSE152075 Summary 1. SARS-CoV-2 triggered an antiviral response driven by interferon and concurrently decreased the transcription of ribosomal proteins. Levels of B cells and neutrophils were higher in patients with a lower viral load. 2. A decrease in the expression of chemokines CXCL9/10/11, cognate receptors CXCR3, CD8A, and granzyme B was observed in elderly individuals. 3. B- and NK-cell-specific transcripts were reduced and NF-κB inhibitors increased in males. [78]GSE163151 A diagnostic host response biosignature for COVID-19 from RNA profiling of nasal swabs and blood [[79]32] Nasopharyngeal swab [80]GPL24676 Illumina NovaSeq 6000 (Homo sapiens) 149 (COVID: control = 138: 11) [81]GSE163151 Summary 1. A two-step classifier based on machine learning that was run on an individual test set of NP swab samples was able to segregate COVID-19 and non-COVID-19 and infectious or noninfectious acute respiratory disease with 85.1%-86.5% accuracy. 2. SARS-CoV-2 infection has unique biologic features and differences between NP swabs and whole blood that can be used in the differential diagnosis of COVID-19. [82]GSE169241 hPSC-derived cells to model macrophage-mediated inflammation in COVID-19 Hearts [[83]33] Human heart autopsy tissues [84]GPL24676 Illumina NovaSeq 6000 (Homo sapiens) 8 (COVID: control = 3:5) [85]GSE169241 Summary 1. COVID-19 patients showed a significant increase in CCL2 expression and macrophage infiltration in cardiac tissue. 2. After SARS-CoV-2 exposure, macrophages induced increased cardiomyocyte reactive oxygen species and apoptosis via secretion of IL-6 and TNF-α. [86]GSE157103 Large-scale multiomic analysis of COVID-19 severity [[87]34] Leukocyte [88]GPL24676 Illumina NovaSeq 6000 (Homo sapiens) 126 (COVID: control = 100: 26; ICU: non-ICU = 50: 50) [89]GSE157103 Summary 1. Coagulation-related proteins, as well as cellular fibronectin (cFN), were significantly increased in COVID-19 patients. 2. The abundance of prothrombin, which is cleaved to form thrombin during coagulation, was significantly reduced and correlated with severity, which may help to elucidate the hypercoagulable environment of SARS-CoV-2 infection. [90]GSE152641 Transcriptomic similarities and differences in host response between SARS-CoV-2 and other viral infection [[91]35] Whole blood [92]GPL24676 Illumina NovaSeq 6000 (Homo sapiens) 86 (COVID: control = 62: 24) [93]GSE152641 Summary 1. Changes in gene expression in the peripheral blood of COVID-19 patients correlated highly with changes in response to other viral infections. However, two genes, ACO1 and ATL3, showed significantly different changes in expression. 2. Some dynamic immune evasion and counter host responses are specific to COVID-19. For example, CD56bright NK cells, M2 macrophages, and total NK cells were increased in COVID-19. [94]Open in a new tab Processing of RNA sequencing data NetworkAnalyst is utilized for gene expression analysis, and this study required tuning of various parameters. To initiate gene expression analysis using NetworkAnalyst, the mean is employed for gene-level summarization. It simplifies and standardizes representation, reduces random noise in the data, and makes it more amenable to downstream analyses. Additionally, it helps mitigate issues with multiple testing and operates under the assumption that each transcript contributes equally to the gene’s activity. Before performing differential expression analysis, filtering is employed to enhance statistical power by eliminating genes that do not exhibit a response. To obtain accurate and meaningful inferences from differential expression analysis data, it is imperative to employ appropriate normalization techniques. Filtering aids in the elimination of information that is demonstrably inaccurate or unlikely to be instructive. To modify the quantity of genes excluded from subsequent analysis, it is necessary to set up the parameters for variance and abundance filters. The variance filter eliminates features with a variance percentile rank less than the threshold of those with consistent expression values across circumstances. In this instance, the variance filtering was configured to a threshold of 15, in accordance with the default configuration of NetworkAnalyst. Consequently, data in the lowest 15th percentile of expression will be removed. The parameter for eliminating features with counts below the set threshold is referred to as low abundance. In this study, the default value of 4 was employed in NetworkAnalyst. To accurately detect transcriptional differences and to guarantee that the expression distributions of each sample are consistent throughout the whole experiment, normalization is essential. Log2-counts per million (Log2-CPM) is a normalization method commonly used in RNA-seq data processing. Log transformation helps to compress the range of data, ensuring that genes with high expression variability do not disproportionately influence subsequent analyses. The limma approach is commonly utilized in the context of differential expression analysis because of its use of linear models, which frequently leads to improved computational efficiency when compared to alternative methods for differential expression analysis [[95]36]. The adjusted p value was set to 0.05 [[96]37] and the log[2]-fold change to 1.5 [[97]38] to ensure that the differentially expressed genes being identified were statistically significant and biologically meaningful with sufficient variance. This makes the results more biologically relevant and valuable for application. The volcano plots of the significantly expressed genes in [98]GSE152075, [99]GSE169241, [100]GSE157103, [101]GSE163151, and [102]GSE152641 show genes with increased and decreased gene expression (S1 Fig in [103]S1 File) [[104]39]. Gene set enrichment and pathway analysis After obtaining the significantly expressed genes from NetworkAnalyst’s differential gene expression analysis, we used Ingenuity Pathway Analysis (IPA; version 84978992) for a comparative analysis of the five sample sets and obtained toxicity lists for each sample set from the Tox analysis. DAVID, ClueGO, the cnetplot from the clusterProfiler package in R and GSEA were used for pathway enrichment analysis and functional annotation. The significantly expressed genes of these transcriptomes were analyzed in terms of KEGG, REACTOME, and WIKIPATHWAYS in DAVID. Genes are linked with enriched pathways with GO, KEGG, and REACTOME by cnetplot. GSEA assesses how predefined gene sets are distributed within a gene table sorted by their correlation with the phenotype, helping to determine their impact on the phenotype. The gene set database utilized in the analysis was h.all.v2023.1.Hs.symbols.gmt. A total of 1000 permutations were performed, with gene symbols being collapsed in the database. The permutation type employed was phenotypic, and the chip platform used was Human_Gene_Symbol_with_Remapping_MSigDB.v2023.1.Hs.chip. ClueGO was used for comprehensive pathway and biological analysis and functional annotation of GO terms and pathways. The databases of GO, KEGG, WIKIPATHWAYS, REACTOME Reactions, and REACTOME Pathways were included in the ClueGO setting panel. The parameters encompass a p value threshold established at 0.05, a GO hierarchy ranging from level 8 to 15, and a pathway selection criterion of 2 genes with 6% per pathway. Other parameters were set to default. In addition, a Venn diagram was used to identify common genes for analysis of the biological meaning in the mitochondria, heart, kidney, and liver through their transcriptomes. Machine learning The [105]GSE152075 sample set was selected for machine learning based on several considerations. In numerous machine learning tasks, it is generally observed that the performance of the model tends to improve as the number of samples increases. This can be attributed to the fact that the model is subjected to a larger volume of data, enabling it to acquire a broader range of features and patterns. A greater number of data typically leads to improved generalization capabilities of the model. When performing model training with a limited amount of data, there is an increased likelihood of encountering the phenomenon known as overfitting [[106]40]. Overfitting refers to a situation in which the model demonstrates excellent performance when evaluated on the training data but fails to generalize effectively when presented with fresh, previously unseen data. This implies that the model could exhibit excessive complexity and has inflexibly acquired the characteristics of the training data, resulting in poor performance when applied to novel data. The utilization of diverse sample sets may yield disparate outcomes, introducing complexity to the analysis. The [107]GSE169241 sample set derives from human heart autopsy tissues, [108]GSE157103 from leukocytes, and [109]GSE152641 from whole blood samples. Similar to [110]GSE152075, [111]GSE163151 utilized nasopharyngeal swabs as the source of samples. However, due to the severe imbalance of samples between the COVID and control groups (138:11) in [112]GSE163151, even employing techniques such as SMOTE to address the issue of imbalanced data might not produce adequate outcomes. We use several indexes, including F1-Score, MCC, and AUC, to evaluate the performance of models, each with its specific formula, significance, and application, particularly in scenarios with imbalanced data. The metric of F1-Score is the harmonic mean of precision and recall, calculated as [MATH: F1-Score=2×Precision×RecallPrecision+Recall :MATH] A higher F1-Score, closer to 1, indicates better model performance, balancing the precision and recall, which is especially crucial in imbalanced datasets. An F1-Score near 0 indicates poor model performance. MCC (Matthews Correlation Coefficient) is calculated using the formula: [MATH: MCC=TP×TNFP×FNTP+FPTP+FNTN+FPTN+F< mi>N :MATH] Where TP, TN, FP, and FN represent true positives, true negatives, false positives, and false negatives, respectively. MCC values range from -1 to +1. MCC is particularly valuable in imbalanced datasets as it provides a balanced measure even when class distribution is skewed. AUC (Area Under the Curve) refers to the area under the ROC (Receiver Operating Characteristic) curve, AUC ranges from 0 to 1. AUC is less sensitive to class imbalance, making it a robust measure for evaluating models on imbalanced datasets. In imbalanced data scenarios, these metrics are crucial as they offer more comprehensive insights into a model’s performance than mere accuracy. While a high accuracy might be misleading in such cases, a high score in F1, MCC, and AUC indicates that the model effectively handles both minority and majority classes, providing a holistic assessment of its predictive capabilities across different class distributions. The machine learning comparison was conducted on four sample sets, [113]GSE152075, [114]GSE163151, [115]GSE157103, and GSE1526414, with the top 40 significantly expressed genes (S1 Table in [116]S1 File), which is explained for the number of gene selection in the subsection of sensitivity analysis of machine learning with varying significant gene counts. The [117]GSE169241 sample set was excluded from the comparison due to its small sample size. Based on the factors mentioned above, the overall performance of [118]GSE152075 generally showed the most optimized selection regarding machine learning performance, sample size, imbalanced data state, and overfitting assessment (S2 Table in [119]S1 File). In addition, utilizing several sample sets can offer a more comprehensive outlook; however, performing an in-depth examination of a singular sample set facilitates a more intricate study and comprehension. Therefore, we mainly utilized the [120]GSE152075 sample set, which consists of 484 samples driven by the objective of ensuring consistency and uniformity in the subsequent study. With the selected genes from the [121]GSE152075 sample set as features, machine learning methods were used to test whether the association between COVID-19 and effects in the mitochondria, heart, kidney, and liver could be predicted. Through these machine learning results, the pathways and biological meanings of the genes were further analyzed. We employed four machine learning algorithms, including XGBoost [[122]41] with parameters for colsample_bytree = 0.9, learning_rate = 0.1, max_depth = 10, n_estimators = 50, random forest [[123]42] with parameters for max_depth = 10, min_samples_split = 5, n_estimators = 100, logistic regression [[124]43] with parameters for C = 50, max_iter = 5000 and SVM [[125]44] with parameters for kernel = ’rbf’, C = 100, gamma = 0.01, probability = True, to test and predict COVID-19. The K-fold cross-validation technique was utilized, with K-fold sets to 10. Since the ratio of the experimental group to the control group in the [126]GSE152075 sample set was 430:54, the control group size was insufficient, which created a problem of imbalanced data. Thus, we employed SMOTE (Synthetic Minority Oversampling Technique) to analyze the samples in the minority category and add new samples to the sample set. SMOTE is a method used to address imbalanced sample sets, especially for oversampling minority classes. In imbalanced sample sets, the number of samples from one class greatly outnumbers the other. This results in many machine learning models being biased toward the majority class as they try to maximize overall accuracy, potentially neglecting or misclassifying the minority class. SMOTE addresses the issue of class imbalance not by simply duplicating samples from the minority class but by generating new synthetic samples. For every sample in the minority class, it identifies k-nearest neighbors, all of which belong to the same minority class. A neighbor is then randomly chosen, and a synthetic sample is produced at a random point between this sample and its selected neighbor. This procedure is iteratively carried out until a desired sample count or proportion for the minority class is reached. The synthesis method of SMOTE can be described using the following mathematical formula: Given a sample x[i] from the minority class, a random selection is made from its k nearest neighbors, denoted as x[zi]. Next, a random number λ between 0 and 1 is chosen. The new synthetic sample x[new] can be generated using the formula: [MATH: xnew=< msub>xi+λ×(xzixi) :MATH] This method ensures that the new synthetic sample lies somewhere on the line segment between the original sample and its chosen neighbor. Each iteration might yield different results because λ is randomized. SMOTE offers notable advantages in tackling data imbalance. Instead of duplicating minority class samples, it produces synthetic samples, enhancing the sample set’s diversity and reducing the risk of overfitting. Furthermore, by expanding the minority class data, SMOTE ensures that models can better grasp the nuances of this class, leading to improved prediction accuracy [[127]45]. Using this method, the ratio of the experimental group to the control group was 1:1. SHAP (SHapley Additive exPlanations, 0.41.0) was used to analyze the prediction interpretation of the contribution of each feature. We then calculated the Shapley value of each feature to measure the contribution of the feature to the prediction so that the contribution of each feature could be understood in detail [[128]46]. The codes are available at [129]https://github.com/ntumitolab/ML-RNA-Seq. Results After analyzing and processing the RNA sequencing raw count data for sample sets [130]GSE152075, [131]GSE163151, [132]GSE169241, [133]GSE157103 and [134]GSE152641, the significant genes of each sample set were obtained and tested in machine learning, pathway analysis and IPA-Tox analysis. Sensitivity analysis of machine learning with varying significant gene counts The initial investigation involved assessing the predictive capabilities of the features identified by the statistical methodology for COVID-19 to gain insights into the potential utility of machine learning. Consequently, we ranked the significantly expressed genes from the differential gene expression analysis of [135]GSE152075 from NetworkAnalyst by their adjusted p value. The top 100 significantly expressed genes from the differential expression analysis of [136]GSE152075 (S3 Table in [137]S1 File) were ordered by the adjusted p value and used for machine learning and pathway analysis. The machine learning prediction power for predicting COVID-19 was tested on the top 100, 80, 60, 50, 40, 30, and 20 significantly expressed genes, and the results showed that the accuracy, F1-Score, and AUC of the four machine learning algorithms except SVM were near or above 90% for the top 30–100 significantly expressed genes ([138]Table 2). This implies 30 significantly expressed genes to be sufficient for good prediction power. However, the overall performances of 40 significantly expressed genes in these four machine learning algorithms were better than 30 significantly expressed genes. Therefore, we proceeded with 40 genes for a more comprehensive pathway enrichment exploration. Table 2. Machine learning results of the top 100, 80, 60, 50, 40, 30, 20 significantly expressed genes, and randomly selected 40 genes in [139]GSE152075. Features 100 genes 80 genes 60 genes 50 genes 40 genes 30 genes 20 genes Randomly selected 40 genes Machine Learning XGBoost Accuracy 0.963 0.963 0.965 0.963 0.969 0.961 0.950 0.851 Sensitivity 0.979 0.984 0.984 0.979 0.984 0.977 0.967 0.921 Specificity 0.833 0.796 0.815 0.833 0.852 0.833 0.815 0.296 Precision 0.979 0.975 0.977 0.979 0.981 0.979 0.977 0.912 F1-Score 0.979 0.979 0.980 0.979 0.983 0.978 0.972 0.917 MCC 0.812 0.807 0.819 0.812 0.842 0.804 0.758 0.225 AUC 0.973 0.972 0.978 0.974 0.972 0.970 0.943 0.657 Random Forest Accuracy 0.965 0.969 0.969 0.965 0.969 0.965 0.952 0.849 Sensitivity 0.988 0.991 0.991 0.988 0.991 0.988 0.984 0.919 Specificity 0.778 0.796 0.796 0.778 0.796 0.778 0.704 0.296 Precision 0.973 0.975 0.975 0.973 0.975 0.973 0.964 0.912 F1-Score 0.980 0.983 0.983 0.980 0.983 0.980 0.974 0.915 MCC 0.815 0.837 0.837 0.815 0.837 0.815 0.745 0.220 AUC 0.975 0.970 0.963 0.964 0.961 0.964 0.956 0.646 Logistic Regression Accuracy 0.938 0.924 0.913 0.909 0.899 0.897 0.798 0.698 Sensitivity 0.944 0.935 0.916 0.909 0.902 0.898 0.781 0.726 Specificity 0.889 0.833 0.889 0.907 0.870 0.889 0.926 0.481 Precision 0.985 0.978 0.985 0.987 0.982 0.985 0.988 0.918 F1-Score 0.964 0.956 0.949 0.947 0.941 0.939 0.873 0.810 MCC 0.737 0.676 0.669 0.667 0.628 0.631 0.487 0.143 AUC 0.976 0.974 0.973 0.974 0.976 0.962 0.943 0.691 SVM Accuracy 0.880 0.851 0.829 0.820 0.800 0.725 0.643 0.446 Sensitivity 0.867 0.847 0.821 0.807 0.786 0.700 0.607 0.426 Specificity 0.981 0.889 0.889 0.926 0.907 0.926 0.926 0.611 Precision 0.997 0.984 0.983 0.989 0.985 0.987 0.985 0.897 F1-Score 0.928 0.910 0.895 0.889 0.875 0.819 0.751 0.577 MCC 0.638 0.546 0.511 0.517 0.480 0.408 0.337 0.023 AUC 0.975 0.969 0.968 0.967 0.962 0.957 0.962 0.654 [140]Open in a new tab A randomly selected 40 genes from all genes of [141]GSE152075 as machine learning features were also tested as a baseline to validate the result of assessing the predictive capabilities of the features identified by the statistical methodology for COVID-19. The result showed that the machine learning predictive capabilities cannot be established without pre-process feature selection. The top 40 significantly expressed genes in the [142]GSE152075 sample set ([143]Fig 2A) were then analyzed in DAVID for KEGG, REACTOME, and WIKIPATHWAYS pathway analysis. COVID-19- and SARS-CoV-2-related pathways were identified as some of the major pathways from these top 40 significantly expressed genes. In addition, pathways related to inflammatory cytokine storms, such as interferon signaling [[144]47], interferon alpha/beta/gamma signaling, and cytokine signaling in the immune system, were identified ([145]Fig 2B), accounting for a large proportion of the pathway analysis results in ClueGo ([146]Fig 2C). Genes linked with enriched pathways by cnetplot shows GO enrichment analysis for the top 40 DEGs in Biological Process (BP) is mainly virus-related pathway ([147]Fig 2D), KEGG is coronavirus disease—COVID-19 ([148]Fig 2E), and REACTOME is interferon-related signaling ([149]Fig 2F). Interferon alpha/gamma signaling can also be seen in GSEA ([150]Fig 2G). Hence, it can be inferred that inflammatory cytokine storms are strongly associated with COVID-19. Using these top 40 significantly expressed genes as machine learning features may effectively predict COVID-19. Fig 2. Pathway analysis of the top 40 significantly expressed genes in [151]GSE152075. [152]Fig 2 [153]Open in a new tab (A) List of the top 40 significantly expressed genes in the [154]GSE152075 sample set. (B) DAVID analysis results of the top 40 significantly expressed genes in the REACTOME pathways. (C) Network of pathways from ClueGO for the top 40 significantly expressed genes showing that SARS-CoV-2 and inflammatory cytokine storm-related interferon signaling pathways are the main terms by group among the top 40 significantly expressed genes. Cnetplot of (D) GO enrichment analysis for the top 40 DEGs in Biological Process, (E) KEGG, and (F) REACTOME. (G) GSEA results for enrichment analysis of the top 40 significantly expressed genes. Tox analysis of mitochondria, heart, kidney, and liver The toxicity lists were obtained by performing Tox analysis on the significantly expressed genes in the five sample sets ([155]Table 1) via IPA. The samples from different tissues and sampling platforms showed different common genes ([156]Fig 3) and toxicity lists, indicating that the pathogenesis of COVID-19 is tissue specific. Fig 3. The common genes across the RNA-seq datasets. [157]Fig 3 [158]Open in a new tab Top 100 significantly expressed genes were compared between RNA-seq datasets [159]GSE152075, [160]GSE163151, [161]GSE169241, [162]GSE157103 and [163]GSE152641. The Venn diagram result indicates that the pathogenesis of COVID-19 is tissue specific. The nasopharyngeal swab data from [164]GSE152075 and [165]GSE163151 each showed the largest number of genes in common between these five sample sets and the toxic effects of COVID-19 on the heart, kidney, liver, and mitochondria ([166]Fig 4A and 4B). The human heart tissue samples from [167]GSE169241 showed the toxic effects of COVID-19 mainly on the heart and mitochondria ([168]Fig 4C). While the data on leukocytes from [169]GSE157103 and on whole-blood samples from [170]GSE152641 showed that COVID-19 had toxic effects on the heart, kidney, and liver, the toxic effects on mitochondria were not significant ([171]Fig 4D and 4E). However, the toxicity lists obtained from the [172]GSE157103 leukocyte samples with ICU and non-ICU toxicity analysis showed a toxic mitochondrial effect in addition to the toxic cardiac, renal, and hepatic effects ([173]Fig 4F), which suggests that mitochondrial dysfunction is associated with disease progression in COVID-19 patients [[174]48]. Fig 4. IPA comparative analysis of the RNA-seq data. [175]Fig 4 [176]Open in a new tab Differentially expressed genes between the COVID-19 patients and the control groups were compared in the comparative IPA analysis for toxicity lists and toxicity functions. The following RNA-seq sample sets were obtained from the NCBI Gene Expression Omnibus database (GEO): [177]GSE152075 (A, nasopharyngeal swab), [178]GSE163151 (B, nasopharyngeal swab), [179]GSE169241 (C, human heart autopsy tissues), [180]GSE157103 (D, leukocyte), and [181]GSE152641 (E, whole blood). ICU patients and non-ICU patients were compared in the [182]GSE157103 sample set (F). Machine learning of genes associated with the mitochondria-, heart-, kidney-, and liver-related toxicity list In IPA-tox analysis of the [183]GSE152075 sample set, we identified the differentially expressed genes (DEGs) associated with the mitochondria-related toxicity list ([184]Table 3) under the criteria of -log(p value) > 1.3 for further machine learning analysis. Mitochondrial dysfunction was the relevant pathway identified by the mitochondria-related toxicity list, which included 32 DEGs in [185]GSE150275. Similarly, significantly expressed genes were identified from the heart-, kidney-, and liver-related toxicity lists ([186]Table 3). Among them, the heart-related toxicity lists included cardiac fibrosis [[187]49] and cardiac necrosis/cell death pathways [[188]50]; there were 38 genes in these toxicity lists. The kidney-related toxicity list included renal necrosis/cell death [[189]51], increases renal nephritis [[190]52] panel (PSTC) [[191]53], and increases glomerular injury pathways [[192]54]; there were 55 genes in these toxicity lists. The liver-related toxicity lists included increases in liver hepatitis [[193]55], liver necrosis/cell death [[194]56] and increases in liver damage pathways [[195]57]; there were 42 genes in these toxicity lists ([196]Table 3). These identified genes from the toxicity lists were compared to the top 100 and top 40 significantly expressed genes, and only a few common significantly expressed genes were identified, including NDUFV1 and PRDX5 in mitochondrial dysfunction and RRAD, CIB1, CYBB, and SLC8A1 in the heart. There were many differences in gene expression between those identified by the statistical meaning method and those identified by the biological meaning method. Table 3. Genes associated with mitochondria-, heart-, kidney-, and liver-related toxicity list and top significantly expressed genes in [197]GSE152075. Genes associated with mitochondrial-related toxicity list (32 genes) Genes associated with heart-related toxicity list (38 genes) Genes associated with renal-related toxicity list (55 genes) Genes associated with liver-related toxicity list (42 genes) Related toxicity list Mitochondrial Dysfunction Cardiac Fibrosis Cardiac Necrosis/Cell Death Renal Necrosis/Cell Death Increases Renal Nephritis Renal Safety Biomarker Panel (PSTC) Increases Glomerular Injury Increases Liver Hepatitis Liver Necrosis/Cell Death Increases Liver Damage Genes in toxicity list ACO2, ATP5F1E, ATP5ME, BAD, CLIC2, COX4I1, COX5A, COX5B, COX6A1, COX7B, CYC1, FIS1, GPX1, GPX4, GSTP1, MT-ND6, NDUFA11, NDUFA13, NDUFA2, NDUFA4, NDUFAB1, NDUFB10, NDUFB2, NDUFB7, NDUFS6, NDUFV1, PRDX5, SURF1, TOMM7, UQCR11, UQCRC1, UQCRQ CYBB, TLR4, PRKCB, SLC8A1, TLR2, GPX1, ACE2, DVL1, SLC2A1, STEAP3, CIB1, NDUFS6, MB, CARD6, JUN, LMNA, KLF15, LARP6, DAG1, USP18, NEXN, FLT1, PIN1, NUB1, CTNNB1, PROX1, S100A6, CCN1, RRAD, CASP1, BAD, NDUFA13, SOCS3, KLF4, NTN1, SERPINF1, JUND, ABCC9 MIF, TLR2, CLU, CTNNB1, FLT1, TLR7, TNFSF13B, TNFSF14, OLR1, LRP5, CYBB, GPX4, ZEB1, TLR4, P2RX7, KMO, ACE2, SLC2A1, FOS, LAMB2, PRKCB, MEFV, APRT, HSPA1A, HSPA1B, NDUFAB1, CASP1, STUB1, BAD, ALDH3B1, MLKL, BCL2L14, RFXANK, IDO1, SLC8A1, PRDX2, IER3, GNB2, BIRC3, CITED2, NTN1, GSTP1, ERBB2, ZBP1, APOBEC3A, AIM2, FCGR3A, FCGR3B, CX3CR1, TFF3, DDR1, CD274, JUN, C3AR1, CCR1 TLR4, GBP5, CASP1, CCL4, TLR2, TLR7, JUN, TNFSF14, ALDH3A1, P2RX7, KMO, MIF, BSG, CCR5, IL2RG, ATG4B, AIM2, CCN1, EPHA2, CCL2, KEAP1, CHCHD2, PROS1, SELL, KRT8, CTNNB1, FOS, CXCL10, IRF8, CD274, TKT, BAD, FGL2, GADD45B, SIGIRR, SOCS3, IER3, BIRC3, USP18, PTPRC, JUND, PHB2 Common genes with top 100 significantly expressed genes NDUFV1, PRDX5 RRAD, CIB1, CYBB, SLC8A1 TNFSF13B, CCR1, CYBB, SLC8A1 ALDH3A1, CXCL10, GBP5 Common genes with top 40 significantly expressed genes PRDX5 RRAD _ CXCL10 [198]Open in a new tab Next, these DEGs in the mitochondria-, heart-, renal-, and liver-related toxicity lists were used as features to test the prediction efficiency of COVID-19 in machine learning models. The results showed that the gene sets in the mitochondria-, heart-, renal-, and liver-related toxicity list could mostly provide accuracy, F1-Score, and AUC of the four machine learning algorithms except SVM, which were near or above 90% ([199]Table 4). These machine learning results demonstrated that the toxicity analyses of the mitochondria, heart, liver, and kidney correlated with COVID-19 and were sufficient for machine learning to predict COVID-19. Table 4. Machine learning results of genes associated with the mitochondria-, heart-, kidney-, and liver-related toxicity list in [200]GSE152075. Feature Genes selected from Mitochondrial dysfunction toxicity list (32 genes) Genes selected from Heart toxicity list (Cardiac Fibrosis + Cardiac Necrosis/Cell Death) (38 genes) Genes selected from Renal toxicity list (Renal Necrosis/Cell Death + Increases Renal Nephritis + Renal Safety Biomarker Panel (PSTC) + Increases Glomerular Injury) (55 genes) Genes selected from Liver toxicity list (Increases Liver Hepatitis + Liver Necrosis/Cell Death + Increases Liver Damage) (42 genes) Machine Learning XGBoost Accuracy 0.948 0.938 0.938 0.942 Sensitivity 0.981 0.970 0.974 0.967 Specificity 0.685 0.685 0.648 0.741 Precision 0.961 0.961 0.957 0.967 F1-Score 0.971 0.965 0.965 0.967 MCC 0.723 0.678 0.668 0.708 AUC 0.859 0.913 0.913 0.933 Random Forest Accuracy 0.944 0.952 0.940 0.944 Sensitivity 0.991 0.993 0.986 0.981 Specificity 0.574 0.630 0.574 0.648 Precision 0.949 0.955 0.949 0.957 F1-Score 0.969 0.974 0.967 0.969 MCC 0.687 0.738 0.664 0.697 AUC 0.922 0.955 0.960 0.969 Logistic Regression Accuracy 0.905 0.897 0.907 0.909 Sensitivity 0.930 0.907 0.912 0.914 Specificity 0.704 0.815 0.870 0.870 Precision 0.962 0.975 0.982 0.982 F1-Score 0.946 0.940 0.946 0.947 MCC 0.574 0.600 0.647 0.652 AUC 0.863 0.898 0.919 0.939 SVM Accuracy 0.936 0.899 0.771 0.723 Sensitivity 0.979 0.919 0.756 0.693 Specificity 0.593 0.741 0.889 0.963 Precision 0.95 0.966 0.982 0.993 F1-Score 0.964 0.942 0.854 0.816 MCC 0.646 0.574 0.437 0.425 AUC 0.935 0.949 0.937 0.966 [201]Open in a new tab To determine whether COVID-19 may further impair cardiac, hepatic, and renal function due to mitochondrial dysfunction, we added the 32 significantly expressed genes from the mitochondria-related toxicity lists to the significantly expressed genes of heart-, liver-, and kidney-related toxicity lists. After combining the lists, there were 66 significantly expressed genes related to the mitochondria and heart, 83 significantly expressed genes related to the mitochondria and kidney, and 73 significantly expressed genes related to the mitochondria and liver. The results of this round of machine learning indicated that the accuracy, F1 score, and AUC of all the machine learning algorithms, including SVM, using the three sets of transcriptomes were above 90% ([202]Table 5). We found that adding the significantly expressed genes in the mitochondria-related toxicity list to those of the heart-, liver-, and kidney-related toxicity lists improved the prediction powers in machine learning models. In particular, the accuracy of the machine learning algorithm SVM of significantly expressed genes in the heart-, kidney-, and liver-related toxicity lists all improved notably (S2 Fig in [203]S1 File). Thus, we can infer that COVID-19 may cause further damage to cardiac, liver, and kidney function by damaging mitochondria. Table 5. Machine learning results of genes associated with mitochondria and other organ-related toxicity lists. Features Genes selected from Mitochondrial dysfunction + Heart (Cardiac Fibrosis + Cardiac Necrosis/Cell Death) toxicity list in [204]GSE152075 (66 genes) Genes selected from Mitochondrial dysfunction + Renal (Renal Necrosis/Cell Death + Increases Renal Nephritis + Renal Safety Biomarker Panel (PSTC) + Increases Glomerular Injury) toxicity list in [205]GSE152075 (83 genes) Genes selected from Mitochondrial dysfunction + Liver (Increases Liver Hepatitis + Liver Necrosis/Cell Death + Increases Liver Damage) toxicity list in [206]GSE152075 (73 genes) Machine Learning XGBoost Accuracy 0.955 0.946 0.934 Sensitivity 0.986 0.984 0.965 Specificity 0.704 0.648 0.685 Precision 0.964 0.957 0.961 F1-Score 0.975 0.970 0.963 MCC 0.755 0.707 0.661 AUC 0.905 0.912 0.952 Random Forest Accuracy 0.957 0.957 0.963 Sensitivity 0.991 0.993 0.998 Specificity 0.685 0.667 0.685 Precision 0.962 0.960 0.962 F1-Score 0.976 0.976 0.979 MCC 0.764 0.763 0.799 AUC 0.957 0.964 0.970 Logistic Regression Accuracy 0.919 0.940 0.940 Sensitivity 0.933 0.953 0.951 Specificity 0.815 0.833 0.852 Precision 0.976 0.979 0.981 F1-Score 0.954 0.966 0.966 MCC 0.657 0.727 0.732 AUC 0.906 0.922 0.953 SVM Accuracy 0.917 0.911 0.913 Sensitivity 0.935 0.926 0.926 Specificity 0.778 0.796 0.815 Precision 0.971 0.973 0.975 F1-Score 0.953 0.949 0.950 MCC 0.638 0.628 0.641 AUC 0.952 0.936 0.969 [207]Open in a new tab Common gene analysis of the genes associated with mitochondria-, heart-, kidney-, and liver-related toxicity Finally, we identified the common genes among the significantly expressed genes from the mitochondria-related toxicity list and the three sets of significantly expressed genes from the heart-, kidney-, and liver-related toxicity lists for further analysis. The common genes in the mitochondria- and heart-related toxicity lists were NDUFS6, NDUFA13, GPX1, and BAD; the common genes in the mitochondria- and kidney-related toxicity lists were GPX4, GSTP1, NDUFAB1, and BAD. BAD (BCL2-associated agonist of cell death) was the only common gene between the mitochondria-related toxicity list and heart-, kidney-, and liver-related toxicity lists ([208]Fig 5). The BAD protein, belonging to the Bcl-2 gene family, is a proapoptotic member involved in initiating apoptosis, which might explain the cell death in various tissue types and contribute to further pathogenicity and organ damage [[209]58]. Fig 5. Common genes associated with mitochondria-, heart-, kidney-, and liver-related toxicity lists. [210]Fig 5 [211]Open in a new tab Regarding the common genes between the mitochondria-related toxicity lists and heart- and kidney-related toxicity lists, GPX1 and GPX4 were associated with oxidative stress; NDUFAB1, NDUFS6, and NDUFA13 were associated with OXPHOS; and GSTP1 was associated with the NRF2-mediated oxidative stress response. In addition to mitochondrial dysfunction, oxidative stress and the NRF2-mediated oxidative stress response were included in the [212]GSE152075 toxicity lists. OXPHOS generates reactive oxygen species (ROS) [[213]59]. The presence of excess ROS during the regulation of intracellular signaling may cause irreversible damage to cellular components and trigger apoptosis through the mitochondrial intrinsic apoptotic pathway [[214]60]. Therefore, oxidative stress can cause apoptosis via a mitochondria-dependent pathway [[215]61]. SHAP was used to determine the feature importance of significantly expressed genes in the mitochondria-, heart-, kidney-, and liver-related toxicity lists to analyze which genes have the greatest impact. The top ranking feature importance of the significantly expressed genes CXCL10, ATP5F1E, and ACE2 ([216]Fig 6A) identified using the biological meaning method in the mitochondria-, heart-, kidney-, and liver-related toxicity lists. CXCL10 is associated with cytokine storms [[217]62] and is an important chemokine [[218]63]. Furthermore, it is reported to be an exceptional prognostic biomarker for COVID-19 patients [[219]64, [220]65]. ACE2 is an entry receptor for SARS-CoV-2 and is also associated with mtDNA depletion and mitochondrial dysfunction [[221]66]. ATP5F1E encodes a subunit of mitochondrial ATP synthase. A significant increase in expression of ATP5F1E in COVID-19 patients has been reported [[222]67], which might be related to elevated production of ROS and increased inflammation [[223]68, [224]69]. The network of pathways analysis from ClueGO involved the selection of the top feature importance of significantly expressed genes with SHAP values higher than the average. These genes include CXCL10, RRAD, USP18, ATP5F1E, CIB1, C3AR1, CYBB, PROS1, ACE2, STUB1, UQCRQ, MT-ND6, SLC8A1, NDUFV1, COX5A, FLT1, NDUFA13, NDUFB7, BAD, ATP5ME, NDUFAB1, LAMB2, SOCS3, PHB2, TFF3, and KLF15. The analyzed pathways show that the genes from mitochondria-, heart-, kidney-, and liver-related toxicity lists are closely related to COVID-19 and mitochondria, including the mitochondrial immune response to SARS-CoV-2, COVID-19 adverse outcome pathway, oxidative phosphorylation, type II interferon signaling, regulation of IFNA/IFNB signaling, and electron transport chain:OXPHOS system in mitochondria ([225]Fig 6B). Genes linked with enriched pathways by cnetplot shows GO enrichment analysis for top feature importance of significantly expressed genes with SHAP values higher than the average in Biological Process (BP) is oxidative phosphorylation and ATP-related pathway ([226]Fig 6C), KEGG is also oxidative phosphorylation ([227]Fig 6D), and REACTOME is ATP-related pathway ([228]Fig 6E). Thus, the top-ranking feature importance significantly expressed genes from mitochondria-, heart-, kidney-, and liver-related toxicity lists were identified by a biological meaning approach for machine learning analysis, and the main determinants included factors related to COVID-19, immune response and mitochondria. Several interferon signaling genes, such as IFIT1, IFIT2, IFIT3, IFI5 and CXCL10, which are also associated with inflammatory cytokine storms and immune responses, were among the top 40 significantly expressed genes. Therefore, if the top significantly expressed genes were identified by a statistical meaning approach for machine learning analysis, the main determinants included factors also related to the inflammatory cytokine storm and immune response. Fig 6. Feature importance analysis. [229]Fig 6 [230]Open in a new tab (A) Feature importance is based on SHAP values of the genes from the mitochondria-, heart-, kidney-, and liver-related toxicity lists. (B) ClueGO network analysis of the top features with SHAP values higher than the average. SHAP was also used to identify feature importance genes in XGBoost to further elucidate the heart, kidney, and liver damage caused by mitochondrial infection triggered by COVID-19. Cnetplot of (C) GO enrichment analysis for the top features with SHAP values higher than the average in Biological Process, (D) KEGG, and (E) REACTOME. Discussion Machine learning is often applied for prediction or classification. Here, we employ machine learning in a reverse sense: we first formulated a hypothesis and then used the data that met the hypothesis to perform machine learning. If the accuracy of the prediction was high, the hypothesis had a high probability of being correct in terms of logical inference, which would provide an interpretation for the results obtained through machine learning. The main objective of this study was to investigate the effects of COVID-19 on mitochondrial infection and subsequent damage to the heart, kidneys, and liver. To establish the correlation, two aims were hypothesized and then tested to determine the hypothetical validity. One aim used the statistical meaning approach and generated the prediction for machine learning, which would then be used as a baseline. This approach also allowed us to understand the reliability and rationality of machine learning when used on these sample sets. The other aim used the biological meaning of COVID-19 to analyze and elucidate the main hypothesis of our study, which regarded the infection of mitochondria by COVID-19 and subsequent cardiac, renal, and liver damage. Therefore, this study employed machine learning to evaluate our hypotheses. The first aim used a statistical meaning approach to find relevant genes to be used as machine learning features. The machine learning results showed that by selecting the top significantly expressed genes for machine learning, it was possible to predict COVID-19 effectively, but this approach did not directly show the effect of COVID-19 on mitochondria, which might be because other factors have more direct effects. For example, inflammatory cytokine storms constituted a significant proportion in the DAVID analysis of the top 40 significantly expressed genes. Inflammatory cytokine storms have a strong connection with COVID-19, which causes multiorgan damage [[231]70]. Machine learning can effectively predict COVID-19 using the top 40 significantly expressed genes as features. The second aim tested the machine learning features selected from biological meaning, since the importance of individual genes in biological pathways and biological meaning do not entirely reflect the level of gene expression (i.e., few genes from the selected toxicity lists were shown in the top 100 significantly expressed genes). We used [232]GSE152075 toxicity analysis to identify significantly expressed genes in the mitochondria-, heart-, kidney-, and liver-related toxicity lists. We then used machine learning to validate whether the significantly expressed genes identified by the biological meaning approach are able to predict COVID-19 and whether the infection of mitochondria by COVID-19 might have further biological meaning for potential cardiac, kidney, and liver damage. The results of the final analysis implied a correlation between the impact of COVID-19 on mitochondria and further cardiac, renal, and hepatic impairment. Therefore, we concluded that the effect of COVID-19 on mitochondria is associated with the potential impairment of cardiac, hepatic, and renal functions. Machine learning has been widely applied to the diagnosis of COVID-19 patients. For example, the [233]GSE152075 sample set has been used in other machine learning studies that used automated ML (AutoML) [[234]71] and XGBoost for feature selection [[235]72]. Among the 24 selected feature genes (IGFBP2, KRT8, RPLP0, XAF1, RPL13, OAS2, CES1, RPL4, EEF1G, NR2F6, RPS8, RPL10A, SNX14, C5orf15, TNFRSF19, CD24, ALAS1, CEP112, C9orf24, POLR2J3, AAMP, DUOX2, EMCN, and RPL3), keratin, type II cytoskeletal 8 (KRT8) was the only common gene out of the significantly expressed genes in the mitochondria-, heart-, kidney-, and liver-related toxicity lists obtained from our analysis of [236]GSE152075 gene expression data. Maleknia et al. [[237]73] employed the least absolute shrinkage and selection operator (LASSO) regression model to perform feature selection, and nasopharyngeal swab sample sets from [238]GSE163151, [239]GSE152075, [240]GSE156063, and [241]GSE188678 were applied. Random forest classification was used for training prediction. The common genes shared between those selected via feature selection using this LASSO regression model (COPA, CXCL11, IFI6, MIF, NUCB1, SAMHD1, SIGLEC1, and TMED9) and the top 100 significantly expressed genes of [242]GSE152075 were interferon alpha inducible protein 6 (IFI6), C-X-C motif chemokine ligand 11 (CXCL11), and sialic acid binding Ig-like lectin 1 (SIGLEC1), all of which are related to the immune response. There was one gene in common when screening against all the significantly expressed genes in the [243]GSE152075 mitochondria-, heart-, kidney-, and liver-related toxicity lists, which was macrophage migration inhibitory factor (MIF). Since different feature selection methods yield different genes, the biological pathways or biological meaning they represent can vary. In our study, significantly expressed genes obtained from the statistical meaning approach and from the biological meaning approach had few genes in common. However, both machine learning results were predictive of COVID-19, and only the interpretation they presented differed [[244]74]. As the features identified using the biological meaning approach were related to the mitochondria, heart, kidney, and liver, the interpretation of the machine learning results implied that the effects of SARS-CoV-2 on ACE2 and mitochondria were associated with further impairment of the heart, liver, and kidneys. One interpretation for the machine learning results using the statistical approach to feature selection was that SARS-CoV-2 triggers an inflammatory cytokine storm, which in turn impairs cardiac, hepatic, and renal function. BAD, the only gene in common between the mitochondria-related toxicity list and the heart-, kidney-, and liver-related toxicity lists, is a regulator of apoptosis, and BAD mRNA expression is found in many tissues (heart, liver, spleen, lung, kidney, hypothalamus, pituitary, uterus, and ovary) [[245]75]. Apoptosis occurs via the extrinsic death receptor pathway or the intrinsic intracellular pathway, which ultimately leads to mitochondrial dysfunction. In hepatocytes, the convergence of these cell death pathways also requires mitochondrial damages for effective apoptosis. The mitochondrial pathways of cell death are regulated by interactions among the Bcl-2 protein family members [[246]76]. Liver biopsies from SARS patients have shown that SARS-CoV may induce apoptosis of hepatocytes, leading to liver damage [[247]77]. Apoptosis leads to the deterioration of cardiac contractility observed in COVID-19 patients [[248]78]. There is also evidence showing that COVID-19 causes renal tubular damage due to mitochondrial damage and apoptosis [[249]79]. Therefore, mitochondrial dysfunction may be an important factor in apoptotic cell death that causes cardiac, kidney, and liver damage, and COVID-19 is an important cause of mitochondrial dysfunction. Analysis of these common genes also showed that SARS-CoV-2 can hijack pathways, such as that of oxidative stress, after mitochondrial infection. Mitochondria are the main source of free radicals that are responsible for oxidative stress. If the antioxidant system fails to neutralize these free radicals in a timely manner, oxidative stress occurs, resulting in damage to cells and tissues. Mitochondria and mitochondrial DNA are targets of oxidative stress, as both the membrane structure and the inner components of mitochondria are susceptible to oxidative damage. When oxidative stress damages mitochondria, it affects cellular energy production and metabolism, which in turn affects all of the biological functions of cells and tissues. This stress causes apoptosis, which in turn is associated with impairment of cardiac, liver, and kidney function. Our analysis revealed a correlation between the mitochondrial effects of COVID-19 and further impairment of cardiac, hepatic, and renal function. Mitochondrial dysfunction was determined to be a key factor in COVID-19 [[250]80]. In addition, the analysis of gene expression in A549 and Calu3 cell lines infected with SARS-CoV-2 revealed an increase in the expression of genes related to cytokine production, inflammatory responses, mitochondria and autophagic processes [[251]81]. Mitochondria cause cardiac dysfunction and myocyte damage via loss of metabolic capacity as well as via production and release of viral factors [[252]82]. As our study focused on the analysis of cardiac, renal, and hepatic damage caused by the infection of mitochondria by SARS-CoV-2, we then used machine learning to validate whether the significantly expressed genes associated with mitochondria could be used to predict COVID-19 and to further analyze the damage to the heart, kidney, and liver caused by mitochondrial impairment. Other studies have also shown that COVID-19 induces systemic host responses and transcriptomic changes and that the resulting disruptions affect the biological processes and functions of each organ system [[253]83]. In addition, studies have shown that recovered patients show symptoms of long COVID with multiorgan damage [[254]84–[255]86]. Therefore, further investigation into whether the mitochondrial effects of COVID-19 also cause subsequent development of long COVID symptoms in patients is necessary [[256]87]. Conclusions An increasing number of case studies have recorded acute cardiac manifestations in patients with COVID-19. A significant proportion of patients diagnosed with COVID-19, with or without prior cardiovascular disease, demonstrate high levels of troponin or creatine kinase, suggesting myocardial injury, which in turn leads to cardiac insufficiency and arrhythmia [[257]88]. The association between SARS-CoV-2 infection, mitochondrial dysfunction, and subsequent cardiovascular disease has been shown to be critical [[258]89]. Similar findings have also been observed in other organs, such as the liver [[259]90] and kidneys [[260]91]. In this study, we demonstrated that SARS-CoV-2 can affect mitochondria, directly invade cells in various organs via ACE2, and trigger a cytokine storm, which in turn impairs cardiac, hepatic, and renal function ([261]Fig 7). It is also possible that these correlations are because these pathways are related to each other. For example, direct infection by SARS-CoV-2 via ACE2-dependent pathways correlates with mitochondrial dysfunction [[262]92], and there is a close association between mitochondrial dysfunction and immunosenescence; this may lead to an increased possibility of imbalance in the immune response to SARS-CoV-2 and may manifest as an exaggerated proinflammatory response and a cytokine storm [[263]93], resulting in further multiorgan damage. Fig 7. Mechanisms of COVID-19 causing multiorgan damage. [264]Fig 7 [265]Open in a new tab SARS-CoV-2 can affect mitochondria, directly invade cells in various organs via ACE2, and lead to cytokine storms, oxidative stress, mitochondrial dysfunction and cell death, which can in turn impair cardiac, liver, and kidney function. The figure was created with BioRender.com. This study identified significantly expressed genes in the mitochondria-, heart-, kidney-, and renal-related toxicity lists from Tox analysis for machine learning to validate their association with COVID-19 and conclude that the mitochondrial infection caused by COVID-19 further impairs cardiac, hepatic, and renal function. We also obtained evidence for the correlation between genes, terms, and pathways in the mitochondria, heart, liver, and kidneys during COVID-19 that have been demonstrated in other studies. The aim of this study was to obtain the same conclusion using different methods that extended the inquiry and provided further interpretation for these findings. Although there are many possible mechanisms by which SARS-CoV-2 causes multiorgan damage, including direct cellular invasion via ACE2 and cytokine storms, which in turn impair cardiac, hepatic, and renal function, the hypothesis of this study, the correlation between the mitochondrial impact of COVID-19 and further cardiac, renal, and hepatic impairment, which was tested using machine learning, holds true. By carrying out preliminary analysis through transcriptomics analysis, using machine learning to validate the conclusions of the analysis results, and cross-comparing reports, this analysis process identified and validated the hypotheses. However, importantly, the characteristics of the sample data used, such as the different tissues sampled and the different detection platforms and sample sizes, may affect the validation results. The trained machine learning models from [266]GSE152075 with the top 40 significant genes were tested by other sample sets, including [267]GSE163151, [268]GSE157103, and GSE1526414. The result of prediction accuracy is notably poor, which implies that if the tissues, gene expression detection platforms, or parameter settings are different, the trained machine learning model cannot be utilized in other sample sets (S4 Table in [269]S1 File). From this study, it can be seen that the results of toxicity lists varied between the different tissues sampled, and the results from the same tissues could also vary depending on the sampling platform and the ratio or size of the experimental and control samples. Further interpretation and the use of machine learning to analyze the effects of different sample tissues and different sample sizes on the hypothesis may be a valuable field of interest for subsequent study. Supporting information S1 File (PDF) [270]Click here for additional data file.^ (598.3KB, pdf) Data Availability All RNA-seq files are downloaded from the NCBI Gene Expression Omnibus (GEO) database (accession numbers GSE152075, and GSE163151, GSE157103, GSE169241, GSE152641). Funding Statement This work was supported by the Center for Advanced Computing and Imaging in Biomedicine (NTU-112L900701) from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan. References