Graphical abstract [27]graphic file with name ga1.jpg [28]Open in a new tab Abbreviations: CRC, Colorectal cancer; HCMV, Human Cytomegalovirus; HPV16 and 18, Human papillomavirus; JCV, John Cunningham virus; DEPs, Differentially expressed proteins; NH, Network heterogeneity; NC, Network centralization; SLiMs, Short linear motifs; ELM, Eukaryotic Linear Motif Resource; IDR, Intrinsically disordered region; PME, Particle-Mesh Ewald; RMSD, Root-Mean-Square Deviation; RMSF, Root-Mean-Square Fluctuation; hbonds, Hydrogen bonds; MM-PBSA, molecular mechanics – Poisson Boltzmann surface area; SASA, Solvent Accessible Surface Area Keywords: Colorectal cancer, Virus-host interactions, Small molecules, Docking, Molecular dynamics Highlights * • Systematic analysis of virus-host networks identified key pathways in CRC. * • Core virus-CRC network revealed the growth pathway regulated by viruses. * • Short linear motif analysis identified druggable regions in virus proteins. * • Virtual screening revealed key anti-viral molecules against viral proteins. * • Molecular dynamics simulations showed the effect of anti-viral molecules. Abstract Colorectal cancer (CRC) is a significant contributor to cancer-related deaths caused by an unhealthy lifestyle. Multiple studies reveal that viruses are involved in colorectal tumorigenesis. The viruses such as Human Cytomegalovirus (HCMV), Human papillomaviruses (HPV16 & HPV18), and John Cunningham virus (JCV) are known to cause colorectal cancer. The molecular mechanisms of cancer genesis and maintenance shared by these viruses remain unclear. We analysed the virus-host networks and connected them with colorectal cancer proteome datasets and extracted the core shared interactions in the virus-host CRC network. Our network topology analysis identified prominent virus proteins RL6 (HCMV), VE6 (HPV16 and HPV18), and Large T antigen (JCV). Sequence analysis uncovered short linear motifs (SLiMs) in each viral target. We used these targets to identify the antiviral drugs through a structure-based virtual screening approach. This analysis highlighted that temsavir, pimodivir, famotine, and bictegravir bind to each virus protein target, respectively. We also assessed the effect of drug binding using molecular dynamic simulations, which shed light on the modulatory effect of drug molecules on SLiM regions in viral targets. Hence, our systematic screening of virus-host networks revealed viral targets, which could be crucial for cancer therapy. 1. Introduction Colorectal cancer (CRC) was identified as the third most prevalent cancer worldwide, with continuous increase in instances. It is associated with approximately 9% of cancer deaths [29][1]. Although the familial history of CRC is demonstrated, many CRC cases are sporadic rather than familial [30][2]. Multiple aspects such as heredity, epigenetics, lifestyle, and pathogens are identified as potential risk factors. Recent studies have established the role of viruses in many cancers and it is shown that 9.9% of cancers are associated with viral infections [31][3]. Viruses such as Human Cytomegalovirus (HCMV), Human papillomaviruses (HPV16 & HPV18), and John Cunningham virus (JCV) showed higher prevalence in tumour-associated colorectal tissues [32][4]. In a study conducted by Cobbs et al., [33][5] it was shown that 82% of the colorectal polyps had HCMV tumour promoting viral proteins (IE1-72 and pp65) in comparison to with adjacent non-neoplastic colon biopsy samples. A meta-analysis which combined all the outcomes from the majority of the colorectal cancer studies showed statistically significant level of HPV infection present in CRC tumor tissue [34][6]. Laghi, L et al [35][7] demonstrated that 81–89% of CRC tissues contained JCV, and the virus was predominant in CRC sections compared to the adjacent normal tissues. HCMV is a member of the beta-herpes virus subfamily. HCMV can co-exist with its host and can recrudescent at any moment, resulting in substantial morbidity and even death, particularly in immunocompromised patients [36][8]. According to several reports, HCMV nucleic particles were identified in tissue samples from patients with colorectal cancer [37][9], [38][10]. Some of these viral proteins enhance cellular processes like proliferation, angiogenesis, and metastasis, while others suppress the localized immune response [39][11]. HPV is a family of DNA viruses well known for its role in the occurrence and progression of many cancer types. There are currently 200 HPV genotypes known, with at least 18 belonging to the “high-risk” group primarily responsible for cancer development. Damin et al. [40][12], showed that the expression of viral (HPV) oncoproteins could induce carcinogenesis. Recent reports have demonstrated the presence of HPV16 [41][13] and HPV18 [42][14] DNA or antigen in clinical colorectal samples indicating the possible risk of HPV in CRC. Similarly, JCV being an opportunistic pathogen, is distinctly present in the gastrointestinal tract and remains latent in various organs such as kidneys, and B-lymphocytes [43][15]. Studies by Mou et al. [44][16] revealed the persistent association between JCV and colorectal tumor samples while Ksiaa et al. [45][17] manifested its tumorigenic role. Large T antigen, encoded by JCV, plays a critical role in cancer development by intervening in cellular phenotypes. It is efficient in dysregulating the cell cycle process by interacting with essential proteins (p53 and pRb) and other signalling pathways [46][18]. The mechanism of such viral tumorigenesis can be attributed to Short Linear Motifs (SLiMs). It is observed that, viral pathogens have developed approached for hijacking host cell machinery via SLiMs [47][19]. The knowledge of all virus-encoded protein interactions as well as the differential expression of host proteins in infected cells is essential for gaining a deeper understanding of complex diseases. In recent years, meta-analyses of host-virus protein interactions gained prominent importance in providing novel insights for designing comprehensive antiviral treatment strategies. Virtual screening approaches enable us to screen for the antiviral molecules against viral targets and understanding their behaviour using molecular dynamics simulations and permits us to identify the potential therapeutics. In our study, we chose the four viruses, HCMV, HPV16, HPV18, and JCV. Although the evidence of these viruses in CRC is established, the core molecular interactions through which these viruses target host (human) cellular machinery remains exploratory. Our network analysis identified the core viral-host interaction network involved in CRC and its associated pathway. This study enabled us to screen antiviral molecules against the viral proteins allowing us to discover the potential drug candidates. 2. Materials and methods 2.1. Construction of virus host networks We retrieved the virus-host interactions related to HCMV from Nobre LV et al. [48][20]. HPV16 and 18 virus-host interactomes were extracted from STRING viruses v10.5 database (accessed on: 30/04/2021) ([49]https://viruses.string-db.org/) [50][21]. We used a medium confidence score of 0.4 to retrieve the maximum number of interactions. JCV interactions are extracted from the human-virus interaction database (accessed on: 30/04/2021) (HVIDB, [51]https://zzdlab.com/hvidb/) [52][22]. We rendered these network diagrams in Cytoscape v3.8.2 [53][23]. 2.2. Literature mining of colorectal cancer proteome Through literature mining, we culled the differentially expressed proteins (DEPs) in human CRC proteomes. We retrieved the CRC proteins from Maja et al. [54][24], which gave 900 DEPs. Next, we salvaged 107 and 2778 CRC proteins from Buttacavoli et al. [55][25] and Saleem et al. [56][26] respectively. Finally, we combined these lists of CRC proteins, made a unique list of 3092 proteins, and mapped the fold changes to virus-host networks to derive the virus-host CRC protein interaction networks. 2.3. Mining of core virus–host CRC network We computed the degree for each virus-host CRC interaction network and identified the top 3 viral proteins from each virus type, to derive the subnetworks. Then, we merged these networks and extracted a shared network; this was refined further by selecting the top one virus protein from each virus and identified the core virus-host CRC network. 2.4. Pathway enrichment analysis We performed the pathway enrichment analysis of proteins from virus-host interaction networks using METASCAPE server ([57]https://metascape.org/gp/index.html#/main/step1) (accessed on: 14/06/2021) [58][27], which contained the gene ontology biological process. The pathways were considered statistically significant with a p-value ≤0.05. 2.5. Short linear motif identification and disorder prediction We used the Eukaryotic Linear Motif (ELM) resource (accessed on: 15/06/2021) ([59]http://elm.eu.org/) [60][28] to identify the human proteins having short linear motifs that share a similar region with viral proteins. We used p-value ≤0.01 to filter out the statistically significant motifs. Disprot ([61]https://disprot.org/) (accessed on: 16/06/2021 and 21/06/2022) [62][29] was used to identify the disordered regions in viral proteins. Then, we identified the amino acid residues contributing to the overall disorderness of the protein (disorder score cut-off >0.5). 2.6. Retrieval of three-dimensional protein structures and druggability analysis We retrieved the 3D X-ray based structures of VE6_HPV16 (PDB id: 6SJA) and VE6_HPV18 (PDB id: 6SJV) from RCSB Protein Data Bank ([63]https://www.rcsb.org/) [64][30]. We employed transform restrained Rosetta (trRosetta) algorithm ([65]https://yanglab.nankai.edu.cn/trRosetta/) (accessed on: 17/06/2021) which uses deep learning and direct energy minimization to predict the protein structure [66][31]. While RL6_HCMV structure was modelled based on the de novo approach, large T antigen_JCV was modelled using deep learning approach combined with homology strategy. Its sequence was aligned to the crystal structure of SV40 large T antigen from Macaca mulatta polyomavirus 1 with sequence identity of 76.5 and raw alignment z-score of 30.044. Next, we computed the z-score of homology models to compare them with the experimentally derived structures (X-ray or NMR) using the Protein structure analysis (ProSA-web) tool ([67]https://prosa.services.came.sbg.ac.at/prosa.php) [68][32]. We used Molprobity server (accessed on: 18/06/2021) [69][33] to generate a Ramachandran plot and z-score [70][34] to predict the stereochemical quality of protein structural models. We mapped SLiM binding sites onto 3D protein structures, and computed druggability score using the PockDrug web server (accessed on: 19/06/2021 and 23/06/2022) [71][35]. 3D structures of HIV proteins (gp120 and integrase) were also retrieved from RCSB Protein Data Bank [72][30]. From the uniport database [73][36] annotations (accessed on: 21/06/22), we found that the surface protein gp120 is a part of envelope glycoprotein gp160 (PDB id: 3J70). So, we considered the sequence range corresponding to gp120 (33–511) for further analysis. Similarly, integrase is found to be the part of Gag-Pol Polyprotein (PDB id: 3NFA) that range between 1148 and 1435 amino acids. Therefore, the same region is chosen for SLiM region identification and docking studies. 2.7. Antiviral compounds We extracted the antiviral compounds from the CheMBL database (accessed on: 23/06/2021) ([74]https://www.ebi.ac.uk/chembl/) [75][37]. We filtered these antiviral compounds based on the Lipinski rule of 5 (RO5), a measure of drug likeliness [76][38]. This filtering process gave rise to 190 antiviral compounds. We mapped the CheMBL IDs of these antiviral compounds to PubChem IDs using the Pubchem Identifier Exchange Service. These PubChem IDs were uploaded to the PubChem database ([77]https://pubchem.ncbi.nlm.nih.gov) (accessed on: 23/06/2021) and downloaded the 3D conformers in structural data file (.SDF) format. They were converted to protein data bank format (.PDB) at physiological pH 7 using the OpenBabel v2.4 software [78][39] and used these formatted files for virtual screening. 2.8. Molecular docking We used the virtual screening software PyRx v0.9.9 [79][40] to perform the docking. We used the make macromolecule option to prepare protein structures by removing water, adding hydrogens and partial charges. Make ligand command is utilized to prepare antiviral compounds and converted them to PDBQT format. Then, we initiated the docking using AutoDockVina, which is an integral part of PyRx tool docking to calculate the binding affinities. Keeping the exhaustiveness at 8, we centred the search space and receptor grids around the binding site (SLiM region in the protein targets). This was followed by the construction of XYZ center coordinates and grid boxes for each protein target: RL6_HCMV (center coordinates: −0.48 Å, 0.23 Å, 1.50 Å, grid box: 28.03 Å, 28.36 Å, 23.26 Å), VE6_HPV16 VE6 (center coordinates: −33.13 Å, 61.79 Å, −28.46 Å, grid box: 27.92 Å, 25.22 Å, 25.87 Å), VE6_HPV18 (center coordinates: −30.07 Å, 103.80 Å, 22.41 Å, grid box: 22.41 Å, 18.68 Å, 48.95 Å) and large T antigen_JCV (center coordinates: 103.319 Å,16.620 Å, 26.142 Å, grid box: 19.888 Å, 21.509 Å, 25.421 Å). GP120_HIV (center coordinates: 265.19 Å, 159.01 Å, 201.73 Å, grid box: 30.18 Å, 30.17 Å, 33.94 Å) and Integrase_HIV (center coordinates: 265.19 Å, 159.01 Å, 201.73 Å, grid box: 30.18 Å, 30.17 Å, 33.94 Å). 2.9. Molecular dynamics simulation analysis and binding free energy calculations The molecular dynamics (MD) simulations were carried out using the Gromacs 2020 package [80][41]. To initiate MD, we retrieved the apo form of proteins as well as target-drug complexes from docked poses. Using GROMACS and the SwissParam web server (accessed on: 12/07/2021) [81][42], we adopted the CHARMM27 force fields for protein targets and drug molecules. We performed five simulations for the viral target-drug complex including the one with apo-form of the protein target (without the drug). The pdb2gmx command and the SwissParam server [82][42] were used to construct the topologies for protein and drug. Once, the drug molecule topologies are re-joined to the protein structures, they were inserted into a cubic box using the explicit TIP3 water model with a buffering distance of 1.2 nm. To neutralize the net charges in the system, we added Na and Cl ions to the system. Next, these complexes were subjected to a minimization procedure with the Steepest Descent method for 2000 steps. The systems equilibrated in two steps: NVT and NPT ensembles run with the Brendsen coupling algorithm in the periodic boundary conditions. The temperature was kept constant at 310 K with pressure at 1 bar. The electrostatic interactions with an interpolation order of four and grid spacing of 0.12 nm and constrained using the LINCS algorithm computed with Particle-Mesh Ewald (PME). All the simulations were set at 2 fs and ran for 100 ns. We analysed the MD trajectories with built-in GROMACS, gmx rms, gmx rmsf, gmx hbond commands. The Molecular Mechanics energies combined with Poisson-Boltzmann (MM-PBSA) approach was used to compute the binding free energies of protein-drug complexes at last 50 ns MD trajectories with the g_mmpbsa tool [83][43]. 2.10. Data visualization and statistical analysis The statistical analysis and visualization in the study was done using the R statistical software v4.0 ([84]https://www.r-project.org) unless stated otherwise. We used Cytoscape plug-in, Network analyser [85][44], to compute the network heterogeneity and centralization parameters. The bar plots, topology parameters such as degree, network centralization, heterogeneity, and binding affinities were visualized using the ggplot2 R package. We generated the random network using the Erdos Renyi G(n,p) model, part of the Network Randomizer Cytoscape plug-in. We plotted number of common and unique proteins as a barplot using the UpSet R package. MD trajectories were visualized as line plots using xmgrace tool. Statistical analysis of RMSF values rendered as Boxplots, Wilcoxon-test was calculated using the ggpubr R package, and p-value ≤ 0.05 was considered as significant. We visualized Protein structures and docked complexes in Pymol v2.7 software. The discovery studio visualizer ([86]https://discover.3ds.com/discovery-studio-visualizer-download) was used to visualize the molecular interactions between the drug and protein structure. 3. Results 3.1. In-silico workflow to understand the virus-host relationship Viruses such as HCMV, HPV16, HPV18, and JCV are known to be the risk factors for the occurrence and development of CRC. In this study, we aimed to understand the virus-host interactions in CRC, and for this purpose, we designed an in-silico workflow ([87]Fig. 1A). First, we retrieved the virus-host protein interactions and constructed the networks. Following the literature search, we extracted the CRC associated proteins and mapped to the virus-host interactomes to isolate the virus-host CRC networks. We analyzed the topological parameters of the network and performed pathway enrichment analysis to extract the core virus-host CRC network. Then, we identified the SLiMs the highest ranked protein for each of the four viruses and predicted the disorder regions. We extracted antiviral molecules from the ChemBL database and performed molecular docking with the viral protein targets identified from the core virus-host CRC network. Finally, we analyzed the molecular dynamic interactions of virus-protein drug complexes. Fig. 1. [88]Fig. 1 [89]Open in a new tab Virus – host network analysis in CRC proteome. (A) The schematic representation of integrative data analysis. (B–E) This represents the HCMV, HPV16, HPV18 and JCV interactions with the human host. 3.2. Retrieval of virus-host networks and impact of viruses on host Viruses are known to interact with host cellular machinery, rewire molecular pathways, and promote their replication [90][45]. Therefore, we sought to understand the HCMV, HPV16, HPV18, and JCV relationship with the host (human) concerning the development and advancement of CRC pathobiology. To do this, we retrieved the virus-host interactions of these viruses ([91]Supplementary Fig. 1A–D). We constructed the networks composed of virus-specific proteins and their interactions. As HCMV has the largest genome size, this virus thus has the highest number of proteins (n = 167) connecting with 2175 host proteins ([92]Fig. 1B). While the HPV16 has seven proteins interacting with 100 host proteins ([93]Fig. 1C), HPV18 and JCV viruses have five proteins each collaborating with 47 and 51 host proteins respectively ([94]Fig. 1D-E). Since biological networks are known to be densely connected and participate in various functions, we computed topological parameters; network heterogeneity and centralization of the virus-host networks to compare with random networks. Network heterogeneity (nh) measures the degree of network distribution and it shows that biological network tends to have central nodes [95][46]. The network centralization (nc) index measures the degree of dispersion of all node centrality scores in a network. This analysis revealed that virus-host networks have high scores of nh and nc affirming that biological networks are heterogeneous with hubs and are densely connected in comparison to random networks ([96]Supplementary Fig. 2A–D). Next, we performed pathway enrichment analysis for the host proteins corresponding to each virus network ([97]Supplementary Fig. 3A–D). Most of the proteins from the HCMV network are associated with protein metabolism and localization. Viruses are known to utilize the host metabolic machinery to replicate in the cellular environment [98][47]. On the other hand, the proteins from HPV16 are involved in the G1/S phase transition of the cell cycle and cellular stress response. Similarly, most of the HPV18 network proteins are engaged in cell cycle G1/S phase transition and positively regulate the cell cycle process. Proteins from the JCV network are enriched in negative regulation of G1/S transition mitotic phase. It is shown that viruses interact with cell cycle machinery and enhance their replication [99][48]. Overall, we found that host proteins from all their corresponding virus networks are enriched in pathways associated with cell cycle regulation. 3.3. Colorectal cancer proteins are enriched in virus-host network modules Once we had virus-host networks in hand, we aimed to deduce the virus's relationship with colorectal cancer. We extracted the CRC proteins and constructed a virus-host CRC network ([100]Fig. 2A). Specifically, we mined 3092 CRC proteins from various literature sources (see methods) and mapped them onto the virus-host network of each virus type (HCMV, HPV16, HPV18, and JCV). We found that the HCMV network consists of 728 CRC proteins while the HPV16, HPV18, and JCV networks were included with 26, 47, and 11 ([101]Fig. 2B); this enabled us to create individual virus-host specific CRC networks ([102]Supplementary Fig. 4A–C). Our pathway enrichment analysis revealed that HCMV-CRC network regulates exocytosis, apoptotic signaling, and proteolysis pathways. The HPV16-CRC network is enriched in the G1/S transition of mitotic cell cycle, apoptosis, and DNA repair pathways. HPV18-CRC network participates in myeloid leukocyte mediated immunity, Wnt signaling, and cell cycle pathways. JCV-CRC network is engaged with cell cycle and catabolic processes ([103]Supplementary Fig. 5A-D). Fig. 2. [104]Fig. 2 [105]Open in a new tab Virus-host CRC proteome networks. (A) The outline of the workflow to generate the virus-CRC networks. (B) The UpSet plot displays the exclusive (single black dots) and shared (connecting black dots) virus-host and CRC proteins. (C–F) HCMV, HPV16, HPV18 and JCV networks represent the top three virus proteins connected with CRC proteins in the host. Note: Red and blue colors represent the up and down regulation of CRC proteins. (For interpretation of the references to