Graphical abstract

   [27]graphic file with name ga1.jpg
   [28]Open in a new tab

   Abbreviations: CRC, Colorectal cancer; HCMV, Human Cytomegalovirus;
   HPV16 and 18, Human papillomavirus; JCV, John Cunningham virus; DEPs,
   Differentially expressed proteins; NH, Network heterogeneity; NC,
   Network centralization; SLiMs, Short linear motifs; ELM, Eukaryotic
   Linear Motif Resource; IDR, Intrinsically disordered region; PME,
   Particle-Mesh Ewald; RMSD, Root-Mean-Square Deviation; RMSF,
   Root-Mean-Square Fluctuation; hbonds, Hydrogen bonds; MM-PBSA,
   molecular mechanics – Poisson Boltzmann surface area; SASA, Solvent
   Accessible Surface Area

   Keywords: Colorectal cancer, Virus-host interactions, Small molecules,
   Docking, Molecular dynamics

Highlights

     * •
       Systematic analysis of virus-host networks identified key pathways
       in CRC.
     * •
       Core virus-CRC network revealed the growth pathway regulated by
       viruses.
     * •
       Short linear motif analysis identified druggable regions in virus
       proteins.
     * •
       Virtual screening revealed key anti-viral molecules against viral
       proteins.
     * •
       Molecular dynamics simulations showed the effect of anti-viral
       molecules.

Abstract

   Colorectal cancer (CRC) is a significant contributor to cancer-related
   deaths caused by an unhealthy lifestyle. Multiple studies reveal that
   viruses are involved in colorectal tumorigenesis. The viruses such as
   Human Cytomegalovirus (HCMV), Human papillomaviruses (HPV16 & HPV18),
   and John Cunningham virus (JCV) are known to cause colorectal cancer.
   The molecular mechanisms of cancer genesis and maintenance shared by
   these viruses remain unclear. We analysed the virus-host networks and
   connected them with colorectal cancer proteome datasets and extracted
   the core shared interactions in the virus-host CRC network. Our network
   topology analysis identified prominent virus proteins RL6 (HCMV), VE6
   (HPV16 and HPV18), and Large T antigen (JCV). Sequence analysis
   uncovered short linear motifs (SLiMs) in each viral target. We used
   these targets to identify the antiviral drugs through a structure-based
   virtual screening approach. This analysis highlighted that temsavir,
   pimodivir, famotine, and bictegravir bind to each virus protein target,
   respectively. We also assessed the effect of drug binding using
   molecular dynamic simulations, which shed light on the modulatory
   effect of drug molecules on SLiM regions in viral targets. Hence, our
   systematic screening of virus-host networks revealed viral targets,
   which could be crucial for cancer therapy.

1. Introduction

   Colorectal cancer (CRC) was identified as the third most prevalent
   cancer worldwide, with continuous increase in instances. It is
   associated with approximately 9% of cancer deaths [29][1]. Although the
   familial history of CRC is demonstrated, many CRC cases are sporadic
   rather than familial [30][2]. Multiple aspects such as heredity,
   epigenetics, lifestyle, and pathogens are identified as potential risk
   factors. Recent studies have established the role of viruses in many
   cancers and it is shown that 9.9% of cancers are associated with viral
   infections [31][3]. Viruses such as Human Cytomegalovirus (HCMV), Human
   papillomaviruses (HPV16 & HPV18), and John Cunningham virus (JCV)
   showed higher prevalence in tumour-associated colorectal tissues
   [32][4]. In a study conducted by Cobbs et al., [33][5] it was shown
   that 82% of the colorectal polyps had HCMV tumour promoting viral
   proteins (IE1-72 and pp65) in comparison to with adjacent
   non-neoplastic colon biopsy samples. A meta-analysis which combined all
   the outcomes from the majority of the colorectal cancer studies showed
   statistically significant level of HPV infection present in CRC tumor
   tissue [34][6]. Laghi, L et al [35][7] demonstrated that 81–89% of CRC
   tissues contained JCV, and the virus was predominant in CRC sections
   compared to the adjacent normal tissues.

   HCMV is a member of the beta-herpes virus subfamily. HCMV can co-exist
   with its host and can recrudescent at any moment, resulting in
   substantial morbidity and even death, particularly in immunocompromised
   patients [36][8]. According to several reports, HCMV nucleic particles
   were identified in tissue samples from patients with colorectal cancer
   [37][9], [38][10]. Some of these viral proteins enhance cellular
   processes like proliferation, angiogenesis, and metastasis, while
   others suppress the localized immune response [39][11]. HPV is a family
   of DNA viruses well known for its role in the occurrence and
   progression of many cancer types. There are currently 200 HPV genotypes
   known, with at least 18 belonging to the “high-risk” group primarily
   responsible for cancer development. Damin et al. [40][12], showed that
   the expression of viral (HPV) oncoproteins could induce carcinogenesis.
   Recent reports have demonstrated the presence of HPV16 [41][13] and
   HPV18 [42][14] DNA or antigen in clinical colorectal samples indicating
   the possible risk of HPV in CRC. Similarly, JCV being an opportunistic
   pathogen, is distinctly present in the gastrointestinal tract and
   remains latent in various organs such as kidneys, and B-lymphocytes
   [43][15]. Studies by Mou et al. [44][16] revealed the persistent
   association between JCV and colorectal tumor samples while Ksiaa et al.
   [45][17] manifested its tumorigenic role. Large T antigen, encoded by
   JCV, plays a critical role in cancer development by intervening in
   cellular phenotypes. It is efficient in dysregulating the cell cycle
   process by interacting with essential proteins (p53 and pRb) and other
   signalling pathways [46][18]. The mechanism of such viral tumorigenesis
   can be attributed to Short Linear Motifs (SLiMs). It is observed that,
   viral pathogens have developed approached for hijacking host cell
   machinery via SLiMs [47][19].

   The knowledge of all virus-encoded protein interactions as well as the
   differential expression of host proteins in infected cells is essential
   for gaining a deeper understanding of complex diseases. In recent
   years, meta-analyses of host-virus protein interactions gained
   prominent importance in providing novel insights for designing
   comprehensive antiviral treatment strategies. Virtual screening
   approaches enable us to screen for the antiviral molecules against
   viral targets and understanding their behaviour using molecular
   dynamics simulations and permits us to identify the potential
   therapeutics.

   In our study, we chose the four viruses, HCMV, HPV16, HPV18, and JCV.
   Although the evidence of these viruses in CRC is established, the core
   molecular interactions through which these viruses target host (human)
   cellular machinery remains exploratory. Our network analysis identified
   the core viral-host interaction network involved in CRC and its
   associated pathway. This study enabled us to screen antiviral molecules
   against the viral proteins allowing us to discover the potential drug
   candidates.

2. Materials and methods

2.1. Construction of virus host networks

   We retrieved the virus-host interactions related to HCMV from Nobre LV
   et al. [48][20]. HPV16 and 18 virus-host interactomes were extracted
   from STRING viruses v10.5 database (accessed on: 30/04/2021)
   ([49]https://viruses.string-db.org/) [50][21]. We used a medium
   confidence score of 0.4 to retrieve the maximum number of interactions.
   JCV interactions are extracted from the human-virus interaction
   database (accessed on: 30/04/2021) (HVIDB,
   [51]https://zzdlab.com/hvidb/) [52][22]. We rendered these network
   diagrams in Cytoscape v3.8.2 [53][23].

2.2. Literature mining of colorectal cancer proteome

   Through literature mining, we culled the differentially expressed
   proteins (DEPs) in human CRC proteomes. We retrieved the CRC proteins
   from Maja et al. [54][24], which gave 900 DEPs. Next, we salvaged 107
   and 2778 CRC proteins from Buttacavoli et al. [55][25] and Saleem et
   al. [56][26] respectively. Finally, we combined these lists of CRC
   proteins, made a unique list of 3092 proteins, and mapped the fold
   changes to virus-host networks to derive the virus-host CRC protein
   interaction networks.

2.3. Mining of core virus–host CRC network

   We computed the degree for each virus-host CRC interaction network and
   identified the top 3 viral proteins from each virus type, to derive the
   subnetworks. Then, we merged these networks and extracted a shared
   network; this was refined further by selecting the top one virus
   protein from each virus and identified the core virus-host CRC network.

2.4. Pathway enrichment analysis

   We performed the pathway enrichment analysis of proteins from
   virus-host interaction networks using METASCAPE server
   ([57]https://metascape.org/gp/index.html#/main/step1) (accessed on:
   14/06/2021) [58][27], which contained the gene ontology biological
   process. The pathways were considered statistically significant with a
   p-value ≤0.05.

2.5. Short linear motif identification and disorder prediction

   We used the Eukaryotic Linear Motif (ELM) resource (accessed on:
   15/06/2021) ([59]http://elm.eu.org/) [60][28] to identify the human
   proteins having short linear motifs that share a similar region with
   viral proteins. We used p-value ≤0.01 to filter out the statistically
   significant motifs. Disprot ([61]https://disprot.org/) (accessed on:
   16/06/2021 and 21/06/2022) [62][29] was used to identify the disordered
   regions in viral proteins. Then, we identified the amino acid residues
   contributing to the overall disorderness of the protein (disorder score
   cut-off >0.5).

2.6. Retrieval of three-dimensional protein structures and druggability
analysis

   We retrieved the 3D X-ray based structures of VE6_HPV16 (PDB id: 6SJA)
   and VE6_HPV18 (PDB id: 6SJV) from RCSB Protein Data Bank
   ([63]https://www.rcsb.org/) [64][30]. We employed transform restrained
   Rosetta (trRosetta) algorithm
   ([65]https://yanglab.nankai.edu.cn/trRosetta/) (accessed on:
   17/06/2021) which uses deep learning and direct energy minimization to
   predict the protein structure [66][31]. While RL6_HCMV structure was
   modelled based on the de novo approach, large T antigen_JCV was
   modelled using deep learning approach combined with homology strategy.
   Its sequence was aligned to the crystal structure of SV40 large T
   antigen from Macaca mulatta polyomavirus 1 with sequence identity of
   76.5 and raw alignment z-score of 30.044. Next, we computed the z-score
   of homology models to compare them with the experimentally derived
   structures (X-ray or NMR) using the Protein structure analysis
   (ProSA-web) tool ([67]https://prosa.services.came.sbg.ac.at/prosa.php)
   [68][32]. We used Molprobity server (accessed on: 18/06/2021) [69][33]
   to generate a Ramachandran plot and z-score [70][34] to predict the
   stereochemical quality of protein structural models. We mapped SLiM
   binding sites onto 3D protein structures, and computed druggability
   score using the PockDrug web server (accessed on: 19/06/2021 and
   23/06/2022) [71][35]. 3D structures of HIV proteins (gp120 and
   integrase) were also retrieved from RCSB Protein Data Bank [72][30].
   From the uniport database [73][36] annotations (accessed on: 21/06/22),
   we found that the surface protein gp120 is a part of envelope
   glycoprotein gp160 (PDB id: 3J70). So, we considered the sequence range
   corresponding to gp120 (33–511) for further analysis. Similarly,
   integrase is found to be the part of Gag-Pol Polyprotein (PDB id: 3NFA)
   that range between 1148 and 1435 amino acids. Therefore, the same
   region is chosen for SLiM region identification and docking studies.

2.7. Antiviral compounds

   We extracted the antiviral compounds from the CheMBL database (accessed
   on: 23/06/2021) ([74]https://www.ebi.ac.uk/chembl/) [75][37]. We
   filtered these antiviral compounds based on the Lipinski rule of 5
   (RO5), a measure of drug likeliness [76][38]. This filtering process
   gave rise to 190 antiviral compounds. We mapped the CheMBL IDs of these
   antiviral compounds to PubChem IDs using the Pubchem Identifier
   Exchange Service. These PubChem IDs were uploaded to the PubChem
   database ([77]https://pubchem.ncbi.nlm.nih.gov) (accessed on:
   23/06/2021) and downloaded the 3D conformers in structural data file
   (.SDF) format. They were converted to protein data bank format (.PDB)
   at physiological pH 7 using the OpenBabel v2.4 software [78][39] and
   used these formatted files for virtual screening.

2.8. Molecular docking

   We used the virtual screening software PyRx v0.9.9 [79][40] to perform
   the docking. We used the make macromolecule option to prepare protein
   structures by removing water, adding hydrogens and partial charges.
   Make ligand command is utilized to prepare antiviral compounds and
   converted them to PDBQT format. Then, we initiated the docking using
   AutoDockVina, which is an integral part of PyRx tool docking to
   calculate the binding affinities. Keeping the exhaustiveness at 8, we
   centred the search space and receptor grids around the binding site
   (SLiM region in the protein targets). This was followed by the
   construction of XYZ center coordinates and grid boxes for each protein
   target: RL6_HCMV (center coordinates: −0.48 Å, 0.23 Å, 1.50 Å, grid
   box: 28.03 Å, 28.36 Å, 23.26 Å), VE6_HPV16 VE6 (center coordinates:
   −33.13 Å, 61.79 Å, −28.46 Å, grid box: 27.92 Å, 25.22 Å, 25.87 Å),
   VE6_HPV18 (center coordinates: −30.07 Å, 103.80 Å, 22.41 Å, grid box:
   22.41 Å, 18.68 Å, 48.95 Å) and large T antigen_JCV (center coordinates:
   103.319 Å,16.620 Å, 26.142 Å, grid box: 19.888 Å, 21.509 Å, 25.421 Å).
   GP120_HIV (center coordinates: 265.19 Å, 159.01 Å, 201.73 Å, grid box:
   30.18 Å, 30.17 Å, 33.94 Å) and Integrase_HIV (center coordinates:
   265.19 Å, 159.01 Å, 201.73 Å, grid box: 30.18 Å, 30.17 Å, 33.94 Å).

2.9. Molecular dynamics simulation analysis and binding free energy
calculations

   The molecular dynamics (MD) simulations were carried out using the
   Gromacs 2020 package [80][41]. To initiate MD, we retrieved the apo
   form of proteins as well as target-drug complexes from docked poses.
   Using GROMACS and the SwissParam web server (accessed on: 12/07/2021)
   [81][42], we adopted the CHARMM27 force fields for protein targets and
   drug molecules. We performed five simulations for the viral target-drug
   complex including the one with apo-form of the protein target (without
   the drug). The pdb2gmx command and the SwissParam server [82][42] were
   used to construct the topologies for protein and drug. Once, the drug
   molecule topologies are re-joined to the protein structures, they were
   inserted into a cubic box using the explicit TIP3 water model with a
   buffering distance of 1.2 nm. To neutralize the net charges in the
   system, we added Na and Cl ions to the system. Next, these complexes
   were subjected to a minimization procedure with the Steepest Descent
   method for 2000 steps. The systems equilibrated in two steps: NVT and
   NPT ensembles run with the Brendsen coupling algorithm in the periodic
   boundary conditions. The temperature was kept constant at 310 K with
   pressure at 1 bar. The electrostatic interactions with an interpolation
   order of four and grid spacing of 0.12 nm and constrained using the
   LINCS algorithm computed with Particle-Mesh Ewald (PME). All the
   simulations were set at 2 fs and ran for 100 ns. We analysed the MD
   trajectories with built-in GROMACS, gmx rms, gmx rmsf, gmx hbond
   commands. The Molecular Mechanics energies combined with
   Poisson-Boltzmann (MM-PBSA) approach was used to compute the binding
   free energies of protein-drug complexes at last 50 ns MD trajectories
   with the g_mmpbsa tool [83][43].

2.10. Data visualization and statistical analysis

   The statistical analysis and visualization in the study was done using
   the R statistical software v4.0 ([84]https://www.r-project.org) unless
   stated otherwise. We used Cytoscape plug-in, Network analyser [85][44],
   to compute the network heterogeneity and centralization parameters. The
   bar plots, topology parameters such as degree, network centralization,
   heterogeneity, and binding affinities were visualized using the ggplot2
   R package. We generated the random network using the Erdos Renyi G(n,p)
   model, part of the Network Randomizer Cytoscape plug-in. We plotted
   number of common and unique proteins as a barplot using the UpSet R
   package. MD trajectories were visualized as line plots using xmgrace
   tool. Statistical analysis of RMSF values rendered as Boxplots,
   Wilcoxon-test was calculated using the ggpubr R package, and
   p-value ≤ 0.05 was considered as significant. We visualized Protein
   structures and docked complexes in Pymol v2.7 software. The discovery
   studio visualizer
   ([86]https://discover.3ds.com/discovery-studio-visualizer-download) was
   used to visualize the molecular interactions between the drug and
   protein structure.

3. Results

3.1. In-silico workflow to understand the virus-host relationship

   Viruses such as HCMV, HPV16, HPV18, and JCV are known to be the risk
   factors for the occurrence and development of CRC. In this study, we
   aimed to understand the virus-host interactions in CRC, and for this
   purpose, we designed an in-silico workflow ([87]Fig. 1A). First, we
   retrieved the virus-host protein interactions and constructed the
   networks. Following the literature search, we extracted the CRC
   associated proteins and mapped to the virus-host interactomes to
   isolate the virus-host CRC networks. We analyzed the topological
   parameters of the network and performed pathway enrichment analysis to
   extract the core virus-host CRC network. Then, we identified the SLiMs
   the highest ranked protein for each of the four viruses and predicted
   the disorder regions. We extracted antiviral molecules from the ChemBL
   database and performed molecular docking with the viral protein targets
   identified from the core virus-host CRC network. Finally, we analyzed
   the molecular dynamic interactions of virus-protein drug complexes.

Fig. 1.

   [88]Fig. 1
   [89]Open in a new tab

   Virus – host network analysis in CRC proteome. (A) The schematic
   representation of integrative data analysis. (B–E) This represents the
   HCMV, HPV16, HPV18 and JCV interactions with the human host.

3.2. Retrieval of virus-host networks and impact of viruses on host

   Viruses are known to interact with host cellular machinery, rewire
   molecular pathways, and promote their replication [90][45]. Therefore,
   we sought to understand the HCMV, HPV16, HPV18, and JCV relationship
   with the host (human) concerning the development and advancement of CRC
   pathobiology. To do this, we retrieved the virus-host interactions of
   these viruses ([91]Supplementary Fig. 1A–D). We constructed the
   networks composed of virus-specific proteins and their interactions. As
   HCMV has the largest genome size, this virus thus has the highest
   number of proteins (n = 167) connecting with 2175 host proteins
   ([92]Fig. 1B). While the HPV16 has seven proteins interacting with 100
   host proteins ([93]Fig. 1C), HPV18 and JCV viruses have five proteins
   each collaborating with 47 and 51 host proteins respectively ([94]Fig.
   1D-E). Since biological networks are known to be densely connected and
   participate in various functions, we computed topological parameters;
   network heterogeneity and centralization of the virus-host networks to
   compare with random networks. Network heterogeneity (nh) measures the
   degree of network distribution and it shows that biological network
   tends to have central nodes [95][46]. The network centralization (nc)
   index measures the degree of dispersion of all node centrality scores
   in a network. This analysis revealed that virus-host networks have high
   scores of nh and nc affirming that biological networks are
   heterogeneous with hubs and are densely connected in comparison to
   random networks ([96]Supplementary Fig. 2A–D).

   Next, we performed pathway enrichment analysis for the host proteins
   corresponding to each virus network ([97]Supplementary Fig. 3A–D). Most
   of the proteins from the HCMV network are associated with protein
   metabolism and localization. Viruses are known to utilize the host
   metabolic machinery to replicate in the cellular environment [98][47].
   On the other hand, the proteins from HPV16 are involved in the G1/S
   phase transition of the cell cycle and cellular stress response.
   Similarly, most of the HPV18 network proteins are engaged in cell cycle
   G1/S phase transition and positively regulate the cell cycle process.
   Proteins from the JCV network are enriched in negative regulation of
   G1/S transition mitotic phase. It is shown that viruses interact with
   cell cycle machinery and enhance their replication [99][48]. Overall,
   we found that host proteins from all their corresponding virus networks
   are enriched in pathways associated with cell cycle regulation.

3.3. Colorectal cancer proteins are enriched in virus-host network modules

   Once we had virus-host networks in hand, we aimed to deduce the virus's
   relationship with colorectal cancer. We extracted the CRC proteins and
   constructed a virus-host CRC network ([100]Fig. 2A). Specifically, we
   mined 3092 CRC proteins from various literature sources (see methods)
   and mapped them onto the virus-host network of each virus type (HCMV,
   HPV16, HPV18, and JCV). We found that the HCMV network consists of 728
   CRC proteins while the HPV16, HPV18, and JCV networks were included
   with 26, 47, and 11 ([101]Fig. 2B); this enabled us to create
   individual virus-host specific CRC networks ([102]Supplementary Fig.
   4A–C). Our pathway enrichment analysis revealed that HCMV-CRC network
   regulates exocytosis, apoptotic signaling, and proteolysis pathways.
   The HPV16-CRC network is enriched in the G1/S transition of mitotic
   cell cycle, apoptosis, and DNA repair pathways. HPV18-CRC network
   participates in myeloid leukocyte mediated immunity, Wnt signaling, and
   cell cycle pathways. JCV-CRC network is engaged with cell cycle and
   catabolic processes ([103]Supplementary Fig. 5A-D).

Fig. 2.

   [104]Fig. 2
   [105]Open in a new tab

   Virus-host CRC proteome networks. (A) The outline of the workflow to
   generate the virus-CRC networks. (B) The UpSet plot displays the
   exclusive (single black dots) and shared (connecting black dots)
   virus-host and CRC proteins. (C–F) HCMV, HPV16, HPV18 and JCV networks
   represent the top three virus proteins connected with CRC proteins in
   the host. Note: Red and blue colors represent the up and down
   regulation of CRC proteins. (For interpretation of the references to