Abstract SARS-CoV-2, the virus that causes COVID-19, is a current concern for people worldwide. The virus has recently spread worldwide and is out of control in several countries, putting the outbreak into a terrifying phase. Machine learning with transcriptome analysis has advanced in recent years. Its outstanding performance in several fields has emerged as a potential option to find out how SARS-CoV-2 is related to other diseases. Idiopathic pulmonary fibrosis (IPF) disease is caused by long-term lung injury, a risk factor for SARS-CoV-2. In this article, we used a variety of combinatorial statistical approaches, machine learning, and bioinformatics tools to investigate how the SARS-CoV-2 affects IPF patients’ complexity. For this study, we employed two RNA-seq datasets. The unique contributions include common genes identification to identify shared pathways and drug targets, PPI network to identify hub-genes and basic modules, and the interaction of transcription factors (TFs) genes and TFs–miRNAs with common differentially expressed genes also placed on the datasets. Furthermore, we used gene ontology and molecular pathway analysis to do functional analysis and discovered that IPF patients have certain standard connections with the SARS-CoV-2 virus. A detailed investigation was carried out to recommend therapeutic compounds for IPF patients affected by the SARS-CoV-2 virus. Keywords: SARS-CoV-2, COVID-19, machine learning, idiopathic pulmonary fibrosis, gene ontology, differentially expressed genes Introduction Coronaviruses have various variants that can infect humans and animals [[40]1]. The variants of this virus are responsible for various diseases, ranging from common fever and cold cough to more serious illnesses such as Severe Acute Respiratory Syndrome (SARS) and Middle East Respiratory Syndrome (MERS) [[41]2]. Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) is a new type of coronavirus, which got a lot of attention at the end of 2019 because it was a new variant of coronavirus that had never been observed in humans previously. Coronavirus Disease 2019 (COVID-19) is the name of the new coronavirus, which was first discovered in Wuhan, China, in December 2019 [[42]3]. Chinese officials reported 44 instances of pneumonia with unknown causes to the World Health Organization (WHO) between 31 December 2019, and 3 January 2020 [[43]4]. The first fatality of COVID-19 occurred in Wuhan on 9 January 2020, while the first death outside of China occurred in the Philippines on 1 February 2020. Within a few days, the disease had spread worldwide and was out of control in many nations [[44]4]. On 30 January 2020, the WHO designated the virus as a Public Health Emergency (PHE) of worldwide concern [[45]5]. This virus was declared a pandemic by the same organization on 11 March 2020, after a total of 4500 deaths were reported in 30 countries and territories throughout the world [[46]5]. Italy surpassed China, with the highest reported death cases of this virus reported on 19 March 2020 [[47]4]. The USA has surpassed both China and Italy as the country with the highest confirmed virus cases on 26 March 2020 [[48]4]. On a global basis, the bloodiest week was 13–19 April 2020, when nearly 7460 deaths were officially reported each day by this virus. The pandemic’s epicenter migrated to Latin America and the Caribbean in June 2020. Between 15 July 2020 and 15 August 2020, the region had an average of almost 2500 deaths per day. With over 78 000 cases on 30 August 2020, India surpassed the US record for the highest cases in a single day, and a second wave hit India on 9 April 2021. There were 281 808 270 confirmed cases from December 2020 to December 2021, with 5 411 75 deaths by this virus [[49]6]. On 26 November 2021, WHO designated a new variant (B.1.1.529) of SARS-CoV-2 named Omicron in South Africa. On 26 November, WHO designated Omicron as a variant of concern. The first COVID-19 case associated with the Omicron variant was reported in the USA on 1 December 2021, and at least one Omicron variant had been detected in 22 states as of 8 December 2021. Recently, this new variation of this virus has spread worldwide and is out of control in several countries, putting the outbreak into a terrifying phase. SARS-CoV-2 is a single-stranded RNA virus that is positive in a sense. The Spike (S), envelope (E), membrane (M), and nucleocapsid (N) proteins are the four proteins found in SARS-CoV-2. Spike proteins are responsible for attaching to a host cell’s membrane. Idiopathic pulmonary fibrosis (IPF) disease is a long illness marked by the thickening and stiffening of lung tissue associated with scar tissue formation [[50]7]. In this condition, the sponge or meaty section of the lung becomes scarred or fibrotic. It is a slow-progressing, highly fatal disease that affects roughly 80% of people within 3–5 years of diagnosis [[51]7]. Pulmonary fibrosis affects people in different ways. Various common, easily curable diseases might cause similar symptoms. Shortness of breath and a persistent dry, hacking cough are the most common indications and symptoms of IPF. Many impacted people also notice a decrease in appetite and weight loss over time. Due to a lack of oxygen, some people with IPF acquire enlarged, rounded tips on their fingers and toes (clubbing) [[52]8]. IPF’s cause is not understood. The following are some of the most common risk factors for IPF: Almost all patients with IPF are over 50 years. Genetics, up to 20% of patients with IPF have another family member who suffers from the condition. Approximately 75% of people with IPF smoke now or have in the past. Gastroesophageal reflux or heartburn affects about 75% of people with IPF. Male patients account for roughly 65% of IPF patients [[53]9]. Radiation treatments to the chest or the use of certain chemotherapy medications have been shown to enhance the risk of pulmonary fibrosis [[54]10, [55]11]. SARS-CoV-2 contains spike protein, which has a greater interaction with ACE2, and IPF patients have a lot of this enzyme, confirming IPF as a risk factor for this disease [[56]12, [57]13]. These investigations have revealed several linkages between IPF and COVID-19, which raises concerns. Contributions In this article, we used a variety of combinatorial statistical approaches, machine learning algorithms, and bioinformatics tools to investigate how the SARS-CoV-2 virus affects IPF patients’ complexity. The following are the main contributions of this article: * the experiments have been conducted using a real-time dataset. We have observed common gene identification by machine learning algorithms and various bioinformatics analyses to identify shared pathways and drug targets; * the Protein-Protein Interaction (PPI) network was examined to discover hub-genes and modules. The interactions of transcription factors (TFs) genes and TFs-miRNAs with common differentially expressed genes (DEGs) were also discovered. Furthermore, we used gene ontology (GO) analyses and molecular pathway analyses to do functional analysis and discovered that IPF patients have certain common connections with SARS-CoV-2 infection; * a comprehensive analysis has been conducted to suggest drug molecules for IPF patients with SARS-CoV-2 infections. In the context of molecular-based knowledge and several pathway-based analyses, which illustrate the utility of the biological system for both SARS-CoV-2 and IPF; and * finally, the current challenges and future research directions of integration and interplay between machine learning and bioinformatics have been discussed. The remainder of this study is organized in the following manner. The ‘Materials and methods’ section begins with a full description of the dataset with preprocessing and an overview of selected methodology. The ‘Result analysis’ section discusses the evaluation and interpretation of experimental outcomes for these methodologies. In addition, ‘Discussion’ section contains a lengthy explanation and discusses some application areas for scientific society. Finally, ‘Conclusions’ section contains an overview of the findings and possible future directions. Materials and methods In this section, we have thoroughly detailed the overview of the analysis, including the dataset transformation process and various transcriptome analyses. Overview of approach We applied machine learning and transcriptomic analysis to identify shared associations between SARS-CoV-2 and IPF by employing selected datasets shown in the block diagram in [58]Fig. 1. The machine learning approaches have been used to identify common DEGs of the selected datasets. Furthermore, these shared or common DEGs were used to construct gene–disease association networks, identify GO, pathways, PPI network, hub-genes, transcription factor (TF)–gene, TF–miRNA, and identify candidate drugs. Figure 1: [59]Figure 1: [60]Open in a new tab The complete workflow for the current investigation. Two types of samples (control cells, affected cells) were collected from SARS-CoV-2-infected lung epithelial cells and both are included in the [61]GSE147507 dataset. The [62]GSE52463 dataset contains IPF-affected lung samples. Common DEGs were identified from both the datasets using machine learning technique. From the common DEGs, GO identification, pathway analysis, PPIs network, TF–gene analysis, TF–miRNA analysis, and hub-gene identification were designed and based on those analysis drug molecule identification was performed. Dataset analysis This section has performed a series of operations on the dataset without changing its properties. Also, we have thoroughly explained the overview of the selected dataset. Dataset description We have identified common genetic interrelationships between SARS-CoV-2 and IPF using Ribonucleic Acid Sequencing (RNA-Seq) datasets from the Gene Expression Omnibus (GEO) collection of the National Center for Biotechnology Information (NCBI) directory [[63]14, [64]15]. The transcriptional responses to SARS-CoV-2 infection are contained in the SARS-CoV-2 dataset with GEO accession ID [65]GSE147507 and GEO platform ID [66]GPL18573. In contrast, the transcriptome analysis reveals differential splicing events in IPF lung tissue that are contained in the IPF dataset GEO accession [67]GSE52463 and GEO platform ID [68]GPL11154 [[69]16]. SARS-CoV-2-affected Lung Epithelial Cell (LECs) are found in the [70]GSE147507, while IPF-affected lung tissues are found in the [71]GSE52463 dataset. The [72]GSE147507 dataset contains two types of samples (control and SARS-CoV-2-affected cells) taken from SARS-CoV-2-affected LECs, while the [73]GSE52463 dataset has two types of samples (control and IPF-affected cells). Metadata and count data are also included in both databases. The RNA sequence was extracted from the [74]GSE147507 dataset using high-throughput sequencing technologies on the Illumina NextSeq 500 (Homo sapiens) platform [[75]17]. The IPF dataset, on the other hand, comprises mRNA sequencing of eight IPF-affected lung tissues and seven control lung tissue samples, all of which were sequenced on the Illumina Hi-Seq 2000 (H.sapiens) platform utilizing high-throughput sequencing technology [[76]18]. [77]Table 1 lists the datasets used in this study and their geo-features and sequencing methods. Table 1: Contents of the datasets Properties SARS-CoV-2 IPF GEO Accession [78]GSE147507 [79]GSE52463 GEO Platform [80]GPL18573 [81]GPL11154 Organisms Homo sapiens Homo sapiens Assay type RNA-Seq RNA-Seq Type of the datasets Transcriptional response to SARS-CoV-2 infection In IPF lung tissue, transcriptome analysis indicates distinct splicing events. Instrument Illumina NextSeq 500 Illumina HiSeq 2000 Total GEO samples 110 15 Experiment type High-throughput sequencing for expression profiling High throughput sequencing for expression profiling [82]Open in a new tab Data preparation To achieve optimal performance, it is necessary to clean and prepare the dataset before applying machine learning methods. Data preparation is generally done by removing unnecessary features, checking the variation of independent features, converting non-numerical features, removing outliers, and replacing missing values if they exist. The two fundamental steps apply during the data preparation process. The first is data preprocessing, and the second is the data transformation step. Data preprocessing This dataset originates from multiple heterogeneous sources. Due to its vast size, this dataset is highly susceptible to missing and noisy data. This section discusses the essential steps in data preprocessing: data cleaning and data integration. * Data Cleaning: First, we applied various techniques to remove noise and clean inconsistencies in the metadata and countdata from both datasets. For example, Rosner’s test for outliers checking and the predictive mean matching method for imputing missing values. Then, to apply machine learning techniques, we converted the qualitative values into quantitative values by applying various techniques (e.g. Biobase (version 2.30.0), GEOquery (version 2.40.0), limma (version 3.26.8), and Bioconductor) packages of the R programming language, which is a free, open-source, and open-development software project for the analysis and comprehension of genomic data. * Data Integration: To improve the accuracy, the data integration technique helped us reduce and avoid redundancies in the resulting dataset. This dataset originates from multiple heterogeneous sources. So, it is essential to check both datasets for redundancy and correlation analysis. This analysis has measured how strongly one feature implies the other. [83]Figure 2a and b shows the correlation between different features for the two datasets, [84]GSE147507 and [85]GSE52463, respectively. For our analysis, we have evaluated the correlation between all the features using the following Pearson’s product-moment coefficient equation. [MATH: r= i=1n< mo> xi-xyi-yi =1n < /mo>xi-x2i=1< mi>nyi-y2 :MATH] Figure 2: [86]Figure 2: [87]Open in a new tab The correlation analysis between different features for the two datasets (a) [88]GSE147507 and (b) [89]GSE52463. This analysis has measured how strongly one feature implies the other. where ā is the meaning of x variable and [MATH: y :MATH] is the meaning of y variable, [MATH: xi :MATH] and [MATH: yi :MATH] are values in tuple i. Data transformation We applied this processing step to achieve more efficient resulting processes and easily understand the patterns. Some selected features have larger values than others, which leads to incorrect performance. We have implemented these strategies to scale the selected feature values within a range between [0.0] and [1.0] without changing the characteristics of the data. N=(X−Xmin)/(Xmax−Xmin) where N is the output normalized values, X is an original value and Xmax and Xmin is the maximum and minimum values of the feature, respectively. As shown in the following equation, a technique called minimum–maximum normalization has been used to scale the selected feature values within the range. We have also evaluated the density plot for both datasets. The density plot shows the smooth distribution of the points along the numeric axis. The peaks of the density plot are at the locations where there is the highest concentration of points. [90]Figure 3a and b shows density plots for the two datasets, [91]GSE147507 and [92]GSE52463, respectively. Figure 3: [93]Figure 3: [94]Open in a new tab The density plots of the two datasets (a) [95]GSE147507 and (b) [96]GSE52463. The density plot shows the smooth distribution of the points along the numeric axis. The peaks of the density plot are at the locations where there is the highest concentration of points. Spotting DEGs and shared DEGs between SARS-CoV-2 and IPF A gene is differentially expressed when a statistical discrepancy exists between several test settings during the transcription phase [[97]19]. The major purpose of this study is to identify DEGs that are shared between the [98]GSE147507 and [99]GSE52463 datasets. The DESeq2 and lima packages of the R programming language were used to access data generated by microarray analysis. DEGs from both datasets were identified using a machine learning method. Listing 1 shows the applied procedure of the machine learning algorithm to identify DEGs from both datasets. Across all datasets, significant DEGs were identified using cutoff criteria (P-value < 0.05 and |log Fc| ≥ 1.00). The shared DEGs of the [100]GSE147507 and [101]GSE52463 datasets were found using the online VENN analysis platform JVENN tool [[102]20]. Listing 1. The procedure of the machine learning algorithm to identify DEGs from both datasets. 1 Input: RNA-Seq dataset 2 Output: Identification of differentially expressed genes (DEGs) 3 4 outputFileName=open( “result.csv” , ”a” ) 5 6 datasetName=os.listdir(folderPath) 7 for i in range(len(datasetName)): 8     fileName=open(“GSE”+str(name[i])+”.csv”) 9     fileName.next() 10     for dataset in fileName: 11 #Extract countdata and store in a matrix 12 datasetName=os.listdir(folderPath) 13 countData = read.csv((“GSE”+str(name[i])+ ”filtered_countdata.csv ”): 14         countDataFrame = data.frame(countData) 15         countDataFrameRound=mutate(across(where(is. numeric ),round , 3)) 16 #Extract metadata and store in a matrix 17 metaData = read.csv((“GSE”+str(name[i])+ ”filtered_metadata.csv ”): 18     countDataFrame = data.frame(metaData) 19 #Analyze count data using DESEQ2 20 applyDESeq = DESeqDataSetFromMatrix(countData=countDataFrameRound, 21 colData=metaDataFrame, design=∼treatment, tidy= TRUE ) 22 applyDESeq = DESeq(applyDESeq) 23 result = results(applyDESeq) 24 25 #Result Analysis 26 #Check and Omit the null value 27 checkNull = is.na(result) 28 resultsOmitNa = na.omit(result) 29 30 #Count the up regulated gene 31 resultOmitNaFilterUp = filter(resultOmitNa, log2FoldChange 32                           > 1 &Padj < 0.05) 33 #Count the down regulated gene 34 resultOmitNaFilterDown = filter(resultOmitNa, log2FoldChange 35                             <-1 & P adj < 0.05) 36 #ABS logFC value and setup cuttoff criteria for P adj value 37 resultFinal = filter(resultOmitNa, abs(log2FoldChange) 38                 > 1 &Padj < 0.05) 39 outputFileName.write(resultFinal) 40 outputFileName.close() Identifying of GO and molecular pathway Enrichment analysis of gene set is a technique for identifying DEGs linked to a biological process or molecular function [[103]21]. GO is a classification system that divides genes into biological mechanisms, molecular functions, and cellular components [[104]22]. The purpose of analyzing GO concepts is to understand the molecular activity, cellular structure, and the position in the cell where genes fulfill their functions [[105]23]. We used four databases to find common molecular pathways in IPF and COVID-19: Kyoto Encyclopedia of Genes and Genomes (KEGG) [[106]24], Wiki Pathways [[107]24], Reactome [[108]25], and BioCarta [[109]25]. Various gene annotations may be found in the KEGG, which is commonly used to characterize metabolic pathways. A web-based platform Enrichr has been used to obtain GO, and molecular pathways for the common genes mentioned earlier in this research [[110]26, [111]27]. To derive GO and molecular pathways, we utilized 20 sorted genes. Analysis of PPI network The role of PPIs in cellular biology is projected to be a major focus of research, and it serves as a requirement for system biology [[112]28]. Proteins finish their journey within a cell with a comparable protein affiliation established by a PPI network, indicating the protein processes. Proteins interact with other proteins to carry out their activities inside cells, and the information created by a PPI network informs individuals about the protein’s function [[113]29]. We built the PPI network of DEGs proteins using the STRING resource to exchange activity and physical linkages between IPF and COVID-19 [[114]30]. The STRING generates experimental and predicted outcomes based on the data and the interaction generated by the online tool, which is determined by 3D structures, accessory data, and confidence scores [[115]31]. The confidence score was set using the STRING platform that was different categorized confidence scores (low, medium, and high). We have been worked on the PPI network with a medium confidence score (0.400). We get the exact information, using the network type “full string network” (the edges indicate both functional and physical protein resources) and a selected number of 10 interactors. Then, we consume our PPI network into Cytoscape (version 3.7.1) for visual representation and further PPI network experimental studies. And with that the purpose of identifying hub-genes, the obtained PPIs are analyzed through Cytoscape. Cytoscape is an open-source network visualization framework that serves as a versatile method for combining several datasets to optimize efficiency for various interactions such as protein–protein interactions, genetic interactions, and protein–DNA interactions, among others [[116]32, [117]33]. Identifying of hub-genes and module analysis The PPI networks are nodes, edges, and connections, with hub-genes being the most entangled nodes. The PPI networks are used to identify hub-genes. Hub-genes provide dense areas identified as important parts of the PPIs network. The hub-genes for the associated PPI networks are indicated by CytoHubba, a Cytoscape application plugin [[118]34]. CytoHubba is the most popular Cytoscape hub-genes identification plugin for its user-friendly interface. CytoHubba has 20 different methods for topological analysis (e.g. MCC, Degree, DMNC, MNC, EPC, Bottleneck, etc.). The degree analysis method was employed to find the hub-genes for this study. Because the degree method facilitates analysis by suggesting large, closely compacted modules in the PPI network, it is employed instead of another approach [[119]35]. The Molecular Complex Detection (MCODE) plugin in the Cytoscape software is utilized to locate the most profound modules in the PPIs network [[120]36]. The MCODE method is based on a graph-theoretic clustering algorithm that detects densely connected regions in large protein–protein interaction networks that may represent molecular complexes [[121]36]. The method has the advantage over other graph clustering methods of having a directed mode that allows fine-tuning of clusters of interest without considering the rest of the network and allows examination of cluster inter-connectivity, which is relevant for protein networks. Furthermore, the method is not affected by the known high rate of false positives in data from high-throughput interaction techniques [[122]37]. Moreover, the method is relatively easy to implement and, since it is local density based, has the advantages of both a directed mode and a complex connectivity mode. The MCODE method has also been employed in the PPIs network to locate highly bound areas in the molecular complexes. TF–gene analysis TFs bind to individual genomes and regulate their levels of expression. As a result, it is required for molecular recognition [[123]38]. In all species, TFs control gene expression and play a critical role in transcription. TFs play an important role in a variety of biological processes, including cell cycle regulation and development. TF–gene linkage with the newly discovered top 12 common DEGs among 90 DEGs was used to investigate the effects of TF–genes on functional pathways and genomic levels. By using the Network Analyst tool to find topologically relevant TFs from the ENCODE database, which was used in the TF–gene interaction network [[124]39–41], we were able to exploit TF–gene interactions with previously established common genes. Network Analyst is a web-based tool for doing transcriptional research and meta-analysis on various species, including humans [[125]42, [126]43]. The TF–gene interaction network has made up of 190 nodes and 301 edges. Moreover, the network has 12 DEGs and 178 TF–genes, where HSPB6 is regulated by 85 TF–genes, EPAS1 is regulated by 68 TF–genes, and FCGR2A is regulated by 37 TF–genes according to their degree value. These 178 TF–genes are regulated by more than one common DEG, which indicates high interaction of the TF–genes with common DEGs. TF–miRNA interaction with the common DEGs The miRNAs are short non-RNAs that are expressed by RNA polymerase II and then regulated by a shared biogenic pathway in a step-by-step method. Using a combination of experimental and computational techniques, miRNAs have been discovered in a variety of species. By binding to the 3′-untranslated, miRNA regulates gene expression at the post-transcriptional stage. The RegNetwork database was utilized to collect TF–miRNA coregulatory interactions, which helps to identify the miRNAs and regulatory TF–genes that regulate DEGs of interest at the transcriptional and post-transcriptional phases [[127]43]. We found miRNAs that interact with common DEGs and then utilized the Network Analyst tool to analyze how they interact. With this platform, researchers can find complex datasets and determine biological traits and functions [[128]44]. The network of miRNA–gene interactions was examined using Cytoscape software. By classifying top miRNAs to higher levels, this software aids researchers in determining biological roles and features. The TF–miRNA coregulatory network has 191 nodes and 216 edges. According to research, DEGs engage with 87 miRNAs and 93 TF–genes. Candidate drugs identification Predicting PDI or drug molecule recognition is important for this research. We identified a therapeutic molecule based on the common DEGs of SARS-CoV-2 and IPF using the Enrichr tool and DSigDB database. There are 22 527 gene sets in the drug signatures database. To acquire access to the DSigDB database, the Enrichr platform is employed [[129]45, [130]46]. Enrichr is a well-known web portal with many gene-set libraries that may be used to look into gene-set enrichment on a genome-wide scale [[131]26]. Result analysis The overall performance of the analysis is discussed in this section. Beginning with a discussion of DEGs and mutual DEG identification, the article progresses to a description of the candidate drug identification procedure. DEGs and mutual DEGs identification We investigated the interrelationships and implications of disrupted genes that activate COVID-19 and IPF using the NCBI’s human RNA-seq and microarray datasets. The [132]GSE147507 dataset determines DEGs for SARS-CoV-2, and its GEO platform identifier is [133]GPL18573. There are 926 upregulated and 799 downregulated genes in the [134]GSE147507 dataset, resulting in 1725 DEGs. In the [135]GSE52463 dataset, which has the GEO platform identifier [136]GPL11154, we discovered a total of 1008 DEGs, with 669 upregulated and 339 downregulated genes. The quantitative measurement of the selected datasets is shown in [137]Table 2. After cross-comparative analysis using JVENN, a trustworthy web platform for Venn analysis, we discovered 90 similar DEGs from the [138]GSE147507 and [139]GSE52463 datasets. Twenty common DEGs were chosen for further study from 90 common DEGs based on the P-value (MDK, HP, HSPB6, CHIT1, TNFAIP6, EPAS1, MMP1, CCL18, CXCL6, CCL11, IL1RN, LAMP3, CD207, ARRB1, RNASE2, LILRA1, FCGR2A, STAT4, CD69, and SAMSN1). Additional study has been conducted using these 20 frequent DEGs. [140]Figure 4 depicts the common DEGs as a Venn diagram, with 90 genes discovered to be shared in the [141]GSE147507 and [142]GSE52463 datasets. Table 2: Quantitative measurements of the datasets used in this analysis Properties [143]GSE147507 [144]GSE52463 Common gene analysis DESeq2 and the lima package DESeq2 and the lima package Cutoff criteria P < 0.05 and |log Fc| ≥ 1.0 P < 0.05 and |log Fc| ≥ 1.0 Total DEGs count 1725 genes 1008 genes Upregulated DEGs count 926 genes 669 genes Downregulated DEGs count 799 genes 339 genes [145]Open in a new tab Figure 4: [146]Figure 4: [147]Open in a new tab Common DEGs representation through a Venn diagram. There are 90 genes were found common from the 1635 DEGs of SARS-CoV-2 infection and 918 DEGs of IPF patients. The common DEGs were 3.4% among total 2553 DEGs. GO and molecular pathway analysis Enrichment analysis of gene sets is a technique for identifying DEGs linked to a biological process or molecular function. For this study, we looked at the most prevalent DEGs. GO processes are divided into biological, cellular components, and molecular functions. [148]Table 3 shows the biological process connected to GO keyword identification findings based on the combined score. [149]Table 4 shows the results of the identification of molecular function-related GO keywords based on the combined score. [150]Table 5 also shows the results of the cellular component-related GO keywords identification based on the combined score. The KEGG, Wiki Pathways, Reactome, and BioCarta have been used to find the most impactful pathways of the shared DEGs between IPF and SARS-CoV-2. [151]Tables 6, [152]7, [153]8, and [154]9 show the essential pathways discovered in the datasets. The graphical view of GO terms and pathways analysis are shown in [155]Figs. 5 and [156]6. Table 3: The combined score was used to identify biological process-related GO keywords Group GO ID GO pathways P-value Genes GO biological process GO: 0006032 Chitin catabolic process 6.98E-03 CHIT1 GO: 0090240 Positive regulation of histone H4 acetylation 6.98E-03 ARRB1 GO: 0006030 Chitin metabolic process 6.98E-03 CHIT1 GO: 0072677 Eosinophil migration 2.59E-04 CCL11; CCL18 GO: 0048245 Eosinophil chemotaxis 2.59E-04 CCL11; CCL18 GO: 0070098 Chemokine-mediated signaling pathway 1.83E-05 CXCL6; CCL11; CCL18 GO: 0030593 Neutrophil chemotaxis 1.94E-05 CXCL6; CCL11; CCL18 GO: 0002029 Desensitization of G-protein coupled receptor protein signal 7.97E-03 ARRB1 GO: 0038114 Interleukin-21-mediated signaling pathway 7.97E-03 STAT4 GO: 0098757 Cellular response to interleukin-21 7.97E-03 STAT4 [157]Open in a new tab Table 4: The combined score was used to identify GO keywords linked to molecular functions Group GO ID GO pathways P-value Genes GO molecular function GO: 0019966 Interleukin-1 binding 5.98E-03 IL1RN GO: 0008009 Chemokine activity 1.26E-05 CXCL6; CCL11; CCL18 GO: 0004568 Chitinase activity 6.98E-03 CHIT1 GO: 0042379 Chemokine receptor binding 1.53E-05 CXCL6; CCL11; CCL18 GO: 0005537 Mannose binding 1.09E-02 CD207 GO: 0048020 CCR chemokine receptor bind 6.54E-04 CCL11; CCL18 GO: 0005041 Low-density lipoprotein receptor 1.29E-02 TNFAIP6 GO: 0005125 Cytokine activity 1.53E-05 CXCL6; IL1RN; CCL11; GO: 0005149 Interleukin-1 receptor binding 1.49E-02 IL1RN GO: 0005159 Binding of insulin-like growth factor receptors 1.49E-02 ARRB1 [158]Open in a new tab Table 5: The combined score was used to identify cellular component-related GO keywords Group GO ID GO pathways P-value Genes GO cellular component GO: 1904724 Tertiary granule lumen 1.37E-03 CHIT1; TNFAIP6 GO: 0030669 Clathrin-coated endocytic vesicle membrane 3.25E-02 CD207 GO: 0045334 Clathrin-coated endocytic vesicle 4.88E-02 CD207 GO: 0070820 Tertiary granule 1.15E-02 CHIT1; TNFAIP6 GO: 0030659 Cytoplasmic vesicle membrane 5.27E-02 ARRB1 GO: 0035580 Specific granule lumen 6.02E-02 CHIT1 GO: 0031410 Cytoplasmic vesicle 1.92E-02 CD207; ARRB1 GO: 0005769 Early endosome 2.04E-02 LAMP3; CD207 GO: 0031901 Early endosome membrane 7.05E-02 CD207 GO: 0030665 Clathrin-coated vesicle membrane 7.79E-02 CD207 [159]Open in a new tab Table 6: Pathway analysis results in identification through KEGG using the combined score Database Pathways P-value Gene KEGG IL-17 signaling pathway 1.05E-04 CXCL6; CCL11; MMP1 Chemokine signaling pathway 3.39E-05 CXCL6; CCL11; ARRB1; CCL18 Cytokine–cytokine receptor interaction 1.84E-04 CXCL6; IL1RN; CCL11; CCL18 Rheumatoid arthritis 3.69E-03 CXCL6; MMP1 Asthma 3.06E-02 CCL11 Osteoclast differentiation 7.05E-03 FCGR2A; LILRA1 Relaxin signaling pathway 7.37E-03 MMP1; ARRB1 Bladder cancer 4.02E-02 MMP1 Hedgehog signaling pathway 4.59E-02 ARRB1 Amino sugar and nucleotide sugar metabolism 4.69E-02 CHIT1 [160]Open in a new tab Figure 5: [161]Figure 5: [162]Open in a new tab According to the combined score, (a) biological, (b) molecular function, and (c) cellular component relevant GO keywords were identified. The higher the enrichment score, the higher number of genes are involved in a certain ontology. Figure 6: [163]Figure 6: [164]Open in a new tab The pathway analysis results were identified using (a) KEGG, (b) Wiki Pathways, (c) Reactome, and (d) BioCarta. The results of the pathway terms were identified through the combined score. Table 7: Pathway analysis results in identification through Wiki pathways using the combined score Database Pathways P-value Gene Wiki Pathways Thymic Stromal Lymphopoietin Signaling Pathway 1.00E-03 CCL11; STAT4 Amplification and Expansion of Oncogenic Pathways as Metastatic Traits 1.69E-02 EPAS1 Matrix Metalloproteinases 2.95E-02 MMP1 Signal transduction through IL1R 3.25E-02 IL1RN Type 2 papillary renal cell carcinoma 3.34E-02 EPAS1 Photodynamic therapy-induced NF-kB survival signaling 3.44E-02 MMP1 Bladder Cancer 3.92E-02 MMP1 Integrated Cancer Pathway 4.31E-02 MMP1 Hedgehog Signaling Pathway 4.31E-02 ARRB1 Hepatitis C and Hepatocellular Carcinoma 4.79E-02 MMP1 [165]Open in a new tab Table 8: Pathway analysis results in identification through Reactome using the combined score Database Pathways P-value Gene Reactome PTK6 Expression 4.99E-03 EPAS1 Regulation of gene expression by Hypoxia-inducible Factor 9.96E-03 EPAS1 Chemokine receptors bind chemokines 1.42E-03 CXCL6; CCL11 Oxygen-dependent proline hydroxylation of Hypoxia-inducible Factor Alpha 1.78E-02 EPAS1 Activation of SMO 1.78E-02 ARRB1 Regulation of Insulin-like Growth Factor transport and uptake by Insulin-like Growth Factor Binding Proteins 2.08E-02 MMP1 NOTCH2 Activation and Transmission of Signal to the Nucleus 2.08E-02 MDK Basigin interactions 2.47E-02 MMP1 Regulation of hypoxia-inducible Factor by oxygen 2.56E-02 EPAS1 Cellular response to hypoxia 2.57E-02 EPAS1 [166]Open in a new tab Table 9: Pathway analysis results in identification through BioCarta using the combined score Database Pathways P-value Gene BioCarta Beta-arrest ins in GPCR Desensitization Pathway 3.54E-04 CCL11; ARRB1 NO2-dependent IL12 Pathway in NK cells Pathway 8.96E-03 STAT4 Role of Beta-arrestins in the activation and targeting of MAP kinases Pathway 4.06E-04 CCL11; ARRB1 G-Protein Signaling Through Tubby Proteins Pathway 9.95E-03 CCL11 Roles of Beta-arrestins-dependent Recruitment of Src Kinases in GPCR Signaling Pathway 5.23E-04 CCL11; ARRB1 Activation of PKC through G-protein coupled receptors Pathway 1.09E-02 CCL11 Visual Signal Transduction Pathway 1.29E-02 ARRB1 Attenuation of GPCR Signaling Pathway 1.29E-02 ARRB1 IL12- and Stat4-dependent Signaling Pathway in Th1 Development 1.49E-02 STAT4 Cystic fibrosis transmembrane conductance regulator (CFTR) and beta 2 adrenergic receptor (b2AR) 1.98E-02 CCL11 [167]Open in a new tab Analysis of PPI network for the identification of hub-genes The PPI network analysis is the most important element. This network has conducted hub-gene recognition, module analysis, and drug identification. In STRING, the specific DEGs have been provided as input. The analysis file was re-imported into the Cytoscape software for visualization. For the most frequent DEGs, a PPI network has been created. Finally, the PPIs network results connect to therapeutic compound suggestions, placing the PPIs analysis as the research’s focus. [168]Figure 7 shows the PPI network with 60 nodes and 308 edges. For SARS-CoV-2 and IPF, the PPI network was developed to discover hub-genes and medicinal compounds. Figure 7: [169]Figure 7: [170]Open in a new tab A network of PPIs discovered common DEGs in two illnesses (SARS-CoV-2 and IPF). The orange nodes denote common DEGs, whereas the edges denote the relationship between two genes. The network under investigation has 60 nodes and 308 edges. Identification of hub-genes for therapeutic solutions and module analysis CytoHubba, a Cytoscape software plugin, was used to track the hub-genes from the PPIs network. The degree meaning of the hub-genes, which represents the number of interactions between the genes in the PPI network, has been categorized. Hub-genes are the bulk of interconnected nodes in a PPI network. The topological analysis identified the top five genes (AKT1, IL1B, CCL5, MMP9, and ARRB1) classified as hub-genes based on their degree value. [171]Table 10 shows the results of the topological analysis. These hub-genes could be exploited as biomarkers, leading to new therapeutic approaches for the studied diseases. The network has 50 nodes and 283 edges, and we utilized a degree-sorted circle structure to lay it out. The network of hub-genes is depicted in [172]Fig. 8, with the top five hub-genes AKT1, IL1B, CCL5, MMP9, and ARRB1. Table 10: Exploration of topological results for the top five hub-genes Hub gene Degree Stress Close ness Between ness Bottle neck Clustering coefficient EcCentricity Radiality AKT1 27 3322 42.25000 637.30186 26 0.25356 0.25000 4.47458 IL1B 26 2172 42.33333 475.08574 03 0.34154 0.33333 4.52542 CCL5 22 1216 38.25000 238.70899 14 0.35931 0.25000 4.23729 MMP9 22 1808 39.16667 322.49125 07 0.35498 0.33333 4.33898 ARRB1 19 1630 37.55000 291.37776 06 0.43865 0.25000 4.25424 [173]Open in a new tab Figure 8: [174]Figure 8: [175]Open in a new tab The PPIs network was used to find hub-genes. There are 50 nodes and 283 edges in the network. AKT1 and IL1B have degrees of 27 and 26, respectively, according to topological analysis. CCL5, MMP9, and ARRB1 had degrees of 22, 22, and 19, respectively. TF–gene analysis The Network Analyst platform was used to investigate TF–gene interactions. The common DEGs were used to examine the TF–gene network. There are 190 nodes and 301 edges in the TF–gene network. Furthermore, the network contains 12 DEGs and 178 TF–genes, with 85 TF–genes regulating HSPB6, 68 TF–genes regulating EPAS1, and 37 TF–genes regulating FCGR2A according to their degree value. These 178 TF–genes are regulated by several common DEGs, indicating a high level of interaction between the TF–genes and common DEGs. The TF–gene network is shown in [176]Fig. 9. Figure 9: [177]Figure 9: [178]Open in a new tab The interaction of TF–genes with common DEGs is represented via a network. The common genes are shown by the highlighted yellow color node, while TF–genes are represented by the other nodes. There are 190 nodes and 301 edges in the network. TF–miRNA analysis The TF–miRNA coregulatory network was built using the Network Analyst tool. Analyzing this TF–miRNA coregulatory network revealed the connection of miRNAs and TFs with common DEGs. There are 191 nodes and 216 edges in this coregulatory network. DEGs interact with 87 miRNAs and 93 TF–genes, according to this study. [179]Figure 10 shows the TF–miRNA coregulatory network.  Figure 10: [180]Figure 10: [181]Open in a new tab There are 93 TF–genes, 87 miRNAs, and 11 DEGs in the TF–miRNA network. There are 191 nodes and 216 edges in the network. DEGs are represented by blue nodes, while miRNA is represented by green nodes, and TF–genes are represented by other nodes. Candidate drugs identification and validation Drug compounds for common DEGs have been discovered using the Enrichr platform. Using the DSigDB database, we discovered 10 candidate medicinal compounds. The top 10 chemical compounds have been extracted based on the combined score of P-value and adjusted P-value. NICKEL SULFATE CTD 00001417, Clonidine HL60 UP, and THYMOLPHTHALEIN CTD 00006891 are the three-drug compounds most genes interact with, according to the data. These medicines are common pharmaceuticals for COVID-19 and IPF since these signature drugs have been discovered for common DEGs. [182]Table 11 displays the most efficient medications for the most common DEGs from the DSigDB database. Table 11: The top 10 drug compounds suggested for common DEGs Name of the drugs P-value Adjusted P-value Name of the genes Nickel Sulfate CTD 00001417 1.37E-12 8.81E-10 CXCL6; IL1RN; CCL11; TNFAIP6; EPAS1; MMP1; LAMP3; CD207; STAT4; CD69; SAMSN1 Clonidine HL60 UP 1.04E-06 3.36E-04 IL1RN; FCGR2A; RNASE2; SAMSN1 Thymolphthalein CTD 00006891 3.80E-04 1.01E-02 EPAS1; ARRB1 Peptidoglycan CTD 00006490 4.34E-04 1.07E-02 TNFAIP6; MMP1 Lithocholic acid HL60 UP 4.63E-04 1.10E-02 CD69; SAMSN1 Beclomethasone CTD 00005468 3.93E-05 3.21E-03 IL1RN; CCL11; RNASE2 Salmeterol CTD 00002421 4.92E-04 1.13E-02 CCL11; RNASE2 Mephentermine HL60 UP 4.48E-05 3.21E-03 IL1RN; EPAS1; CD69 Colchicine HL60 UP 8.09E-06 1.04E-03 IL1RN; FCGR2A; EPAS1; SAMSN1 Bromocriptine HL60 UP 6.94E-05 4.07E-03 FCGR2A; TNFAIP6; SAMSN1 [183]Open in a new tab Computationally predicted results usually need experimental verification, but it has more difficulty and limitations in practical implementation. Thus, similar to Zhang et al. [[184]47], they found a novel validation process for suggested drug compounds based on the Receiver Operator Characteristic (ROC) curve. We tried to validate our suggested drug compounds using the ROC curve mechanism. [185]Figure 11 shows the validation performance comparison between the top five suggested drug compounds using the ROC curve. We considered the top five suggested drug compounds, where Nickel Sulfate has a higher validation accuracy than the others, according to the ROC curve. Other suggested drug compounds, as shown in [186]Fig. 11, were also validated using the same procedures, which is much more valuable to the medical community. Figure 11: Figure 11: [187]Open in a new tab Performance comparison of the top five suggested drug compounds based on the ROC curve. We considered the top five suggested drug compounds, where Nickel Sulfate has a higher validation accuracy than the others, according to the ROC curve. Discussion COVID-19 is more common in people who have lung disease. This study contributes to the development of a bioinformatics and machine learning model to identify the Genetic Effect of SARS-CoV-2- and IPF-affected patients. Shortness of breath, cough, and chest pain are the most typical symptoms of these two diseases. About 1725 and 1008 DEGs were found in [188]GSE147507 and [189]GSE52463, respectively, using bioinformatics-related techniques. Common DEGs between the [190]GSE147507 and [191]GSE52463 datasets have been discovered for better coordination. There is a total of 90 DEGs that have been identified. Twenty common DEGs were chosen for further study from 90 common DEGs based on the P-value (MDK, HP, HSPB6, CHIT1, TNFAIP6, EPAS1, MMP1, CCL18, CXCL6, CCL11, IL1RN, LAMP3, CD207, ARRB1, RNASE2, LILRA1, FCGR2A, STAT4, CD69, and SAMSN1). The analysis of GO, KEGG, Wiki Pathways, Reactome, BioCarta pathway analysis, PPIs, TF–gene, TF–miRNA coregulatory network, and candidate drug detection has been continued in the research project. DEGs that have been identified as common have been used to find GO words. GO keywords were identified using the combined score. Biological process, molecular function analysis, and cellular component analysis are the three categories of GO analysis [[192]48]. KEGG, Wiki Pathways, Reactome, and BioCarta were used to identify pathway analysis results. For the most prevalent DEGs, the KEGG pathway has been determined. KEGG is a database that aids researchers in understanding the high-level functions and utility of biological systems. Because hub-gene recognition, module analysis, and drug identification are all strongly dependent on the PPI network, it is the significant part of the research. Common DEGs were also subjected to PPI analysis. The identification of hub-genes in the PPI network was studied. The five genes that have been highlighted are AKT1, IL1B, CCL5, MMP9, and ARRB1. These five genes are classified as hub-genes based on their degree value. The aim of concentrating on a small area is to suggest a more effective medication component. The interaction of TF–genes and miRNAs was investigated to identify transcriptional and post-transcriptional regulators of common DEGs. The specific DEGs have been used to investigate TF–gene interactions. TF–genes act as regulators of gene expression, which can contribute to cancer cell formation. About 85 TF–genes regulate HSPB6, 68 TF–genes regulate EPAS1, and 37 TF–genes regulate FCGR2A according to their degree value in the network, with 12 DEGs and 178 TF–genes. The TF–miRNA coregulatory network depicts the interactions between miRNAs and TF–genes tested for their ability to influence common DEGs. There were 87 miRNAs and 93 TF–genes discovered. Several studies have found evidence of altered miRNA expression in IPF samples, and members of the miR-200 family play a significant role in IPF sample management [[193]49]. Taz et al. [[194]50] investigated only 69 samples, whereas we analyzed 110 SARS-CoV-2 samples. As a result, this research will ideally integrate COVID-19 with IPF risk factor treatment. Chemical testing can be used to verify the drugs’ efficacy. In addition, we thoroughly discussed the application areas of our research for the scientific society. First of all, researchers can use the same approach to investigate the impact of SARS-CoV-2 on other diseases. Also, if a new virus appears, our research will serve as a useful starting point for further investigation. Furthermore, our research suggests several viable drugs, so scientists will be able to find a treatment for SARS-CoV-2 with more research. Finally, our research is an example of a virus's genetic relationship with a certain type of patient. So, researchers can use this methodology to figure out the genetic relationships between different viruses and patients. Conclusions COVID-19 infections have been associated with a high-risk factor for IPF patients. Shortness of breath, cough, and chest pain are the most typical symptoms of these two diseases. We used machine learning and bioinformatics analysis to summarize the relationships between these two disease genes as part of our research. We analyzed DEGs from two selected datasets, analyzed the results using shared gene identification, and discovered SARS-CoV-2- and IPF-affected lung-cell infection responses. As a consequence, we discovered 90 genes that are linked across these datasets. These interconnected genes built the PPI network, which identified the five most important hub-genes. In addition, we looked at SARS-CoV-2 and IPF to see if they might predict the outcomes of identifying infections of other diseases. The therapeutic goals are logically presented because they are executed from the discovery of hub-genes and could work as an effective precursor to meanwhile licensed medications. We believe that the biomarkers, pathways, and molecular markers we discovered will be valuable in developing pharmacological therapies. Declarations Ethical Approval Not applicable (there is no human-related data. So, ethical approval is not taken from the external body of the committee). Consent to Participate Not applicable (there is no human-related data. So, consent is not necessary to take from the participant). Consent to Publish Not applicable (there is no human-related data. So, consent to publish is not necessary to take from the participant). Funding This work was supported by Researchers Supporting Project number (RSP-2021/100), King Saud University, Riyadh, Saudi Arabia. This work was supported in part by funding from the Natural Sciences and Engineering Research Council of Canada (NSERC). Acknowledgement