Abstract
Background
Detection of appropriate receptor proteins and drug agents are equally
important in the case of drug discovery and development for any
disease. In this study, an attempt was made to explore colorectal
cancer (CRC) causing molecular signatures as receptors and drug agents
as inhibitors by using integrated statistics and bioinformatics
approaches.
Methods
To identify the important genes that are involved in the initiation and
progression of CRC, four microarray datasets ([35]GSE9348,
[36]GSE110224, [37]GSE23878, and [38]GSE35279) and an RNA_Seq profiles
([39]GSE50760) were downloaded from the Gene Expression Omnibus
database. The datasets were analyzed by a statistical r-package of
LIMMA to identify common differentially expressed genes (cDEGs). The
key genes (KGs) of cDEGs were detected by using the five topological
measures in the protein–protein interaction network analysis. Then we
performed in-silico validation for CRC-causing KGs by using different
web-tools and independent databases. We also disclosed the
transcriptional and post-transcriptional regulatory factors of KGs by
interaction network analysis of KGs with transcription factors (TFs)
and micro-RNAs. Finally, we suggested our proposed KGs-guided
computationally more effective candidate drug molecules compared to
other published drugs by cross-validation with the state-of-the-art
alternatives of top-ranked independent receptor proteins.
Results
We identified 50 common differentially expressed genes (cDEGs) from
five gene expression profile datasets, where 31 cDEGs were
downregulated, and the rest 19 were up-regulated. Then we identified 11
cDEGs (CXCL8, CEMIP, MMP7, CA4, ADH1C, GUCA2A, GUCA2B, ZG16, CLCA4,
MS4A12 and CLDN1) as the KGs. Different pertinent bioinformatic
analyses (box plot, survival probability curves, DNA methylation,
correlation with immune infiltration levels, diseases-KGs interaction,
GO and KEGG pathways) based on independent databases directly or
indirectly showed that these KGs are significantly associated with CRC
progression. We also detected four TFs proteins (FOXC1, YY1, GATA2 and
NFKB) and eight microRNAs (hsa-mir-16-5p, hsa-mir-195-5p,
hsa-mir-203a-3p, hsa-mir-34a-5p, hsa-mir-107, hsa-mir-27a-3p,
hsa-mir-429, and hsa-mir-335-5p) as the key transcriptional and
post-transcriptional regulators of KGs. Finally, our proposed 15
molecular signatures including 11 KGs and 4 key TFs-proteins guided 9
small molecules (Cyclosporin A, Manzamine A, Cardidigin, Staurosporine,
Benzo[A]Pyrene, Sitosterol, Nocardiopsis Sp, Troglitazone, and
Riccardin D) were recommended as the top-ranked candidate therapeutic
agents for the treatment against CRC.
Conclusion
The findings of this study recommended that our proposed target
proteins and agents might be considered as the potential diagnostic,
prognostic and therapeutic signatures for CRC.
Supplementary Information
The online version contains supplementary material available at
10.1186/s12920-023-01488-w.
Keywords: Colorectal cancer (CRC), Gene expression profiles, Receptor
proteins, Drug agents, Integrated bioinformatics analyses
Introduction
Colorectal cancer (CRC) is an uncontrolled cell growth in the colon,
rectum or appendix. It is the second most commonly diagnosed cancer in
females and the third in males. The world health organization (WHO)
reported in 2018 that over 1.8 million new cases and nearly 862,000
deaths due to CRC worldwide [[40]1, [41]2]. With more than 2.2 million
new cases and 1.1 million fatalities, the global incidence of CRC is
projected to be increased 60% by 2030 [[42]3]. The early stages of CRC
symptoms are uncharacteristic and frequently ignored or misdiagnosed.
Importantly, CRC is diagnosed at the middle or late stages of the
disease. It is characteristically identified at the middle or late
stages of the disease. The fecal-based examination, enteroscopy and
blood-based examination are commonly considered the early detection
methods for CRC [[43]4]. However, several instrument-dependent
detection methods are time-consuming, laborious and expensive. The
leading treatment options for CRC are surgery, adjuvant chemotherapy
(for colon cancer), neoadjuvant radiotherapy (for rectal cancer), and
molecular drugs [[44]5, [45]6]. However, these types of treatments have
several drawbacks. According to the previous studies, less than 15% of
metastatic CRC is suitable for surgery, the spreading rate of CRC
exceeds more than 80% within 3 years after surgery, and the spreading
rate exceeds more than 95% within 5 years after surgery [[46]7].
Although there are some advancement in the case of CRC treatments, the
5-year survival time of patients with this disease has not yet
increased significantly [[47]6]. Therefore, the identification of new
molecular biomarkers is essential for CRC diagnosis, prognosis and new
therapies.
However, new drug discovery is a tremendously challenging,
time-consuming and expensive task. The main challenges are to explore
drug target proteins (receptors) responsible for the diseases and drug
agents (small molecules) that can reduce the diseases by the
interaction with the target proteins. Genomic biomarkers induced
proteins are considered as the key receptors. Transcriptomics analysis
is a widely used popular approach to explore genomic biomarkers
[[48]8–[49]13]. The repurposing of existing drugs for certain diseases
could reduce the time and cost compared to de novo drug development. By
this time, several authors have suggested different sets of key genes
(KGs) to explore molecular mechanisms and pathogenetic processes of CRC
progression [[50]14–[51]45] in which some studies have employed
multiple datasets to identify CRC-causing KGs [[52]15–[53]17, [54]22,
[55]25, [56]26, [57]31, [58]32, [59]37, [60]40–[61]43, [62]46–[63]49].
Few studies also explored their suggested KGs-guided candidate drug
molecules for the treatment against CRC [[64]14, [65]37, [66]40–[67]42,
[68]50–[69]59]. However, their published data did not display any
common KGs as well as common drug molecules (see Additional file [70]1:
Table S1) in all studies. None of those studies investigated the
resistance performance of their suggested KGs-guided drug molecules
against the CRC-causing independent KGs suggested by others. We found
CRC-causing 170 different KGs and associated 64 different drug
molecules in those articles. The articles those suggested therapeutics
agents applied enrichment approach on Cmap, geneXpharma or DGIdb
databases to select the KGs-guided candidate agents for the treatment
against CRC [[71]14, [72]37, [73]40–[74]42, [75]50–[76]59]. They did
not provide pairwise drug-target binding affinity scores, since
enrichment techniques cannot calculate pairwise binding scores. So, it
may be difficult to select most potential drug-target pairs from the
existing studies for experimental validation by the wet-lab
researchers. On the other hand, though the total number of KGs 170 is
much smaller than the whole genome size, it may be yet much laborious,
time consuming and costly for the experimental validation of more than
170 × 64 = 10,880 drug-target pairs by the wet-lab researchers.
Therefore, in this study, our main objectives were to explore (1) more
probable CRC-causing KGs from multiple gene expression profile datasets
through the verification with different benchmark datasets and
independent databases and (2) proposed KGs-guided candidate drug
molecules for the treatment of CRC through the verification of their
resistance power against the CRC-causing top-ranked independent KGs
suggested by others, by molecular docking analysis.
Materials and methods
The overview of this study including materials and methods is
summarized in Fig. [77]1.
Fig. 1 .
[78]Fig. 1
[79]Open in a new tab
The pipeline of this study
Data sources and descriptions
We collected gene expression profiles generated from CRC patients for
exploring drug targets and small molecules (drug agents) for exploring
candidate drugs by molecular docking simulation as described below.
Collection of gene expression profiles for exploring drug-target proteins
(receptors)
Four human CRC microarray datasets ([80]GSE9348, [81]GSE110224,
[82]GSE23878, and [83]GSE35279) and one RNA-Seq dataset ([84]GSE50760)
were downloaded from National Center for Biotechnology Information
(NCBI) Gene Expression Omnibus (GEO) database
([85]https://www.ncbi.nlm.nih.gov/geo/). The platform of [86]GSE9348,
[87]GSE110224, and [88]GSE23878, were [89]GPL570 [HG-U133_Plus_2]
(Affymetrix Human Genome U133 Plus 2.0 Array), [90]GSE35279 was
performed by [91]GPL6480 (Agilent-014850 Whole Human Genome Microarray
4 × 44 K G4112F) and [92]GSE50760 was performed by [93]GPL11154
Illumina HiSeq 2000 (Homo sapiens). The summary of this dataset is
given in Table [94]1.
Table 1.
Details of gene expression profiles that we analyzed
GEO accession Platform Year Country Normal (n) Tumor (n)
[95]GSE9348 [96]GPL570 2010 Singapore 12 70
[97]GSE35279 [98]GPL6480 2013 Japan 5 74
[99]GSE23878 [100]GPL570 2010 Saudi Arabia 24 35
[101]GSE110224 [102]GPL570 2018 Greece 17 17
[103]GSE50760 [104]GPL11154 2014 South Korea 18 18
[105]Open in a new tab
Collection of meta-drug agents for exploring candidate drugs
We collected meta-drug agents from the online database DSigDB [[106]60]
with respect to the proposed receptors and FDA approved repurposed
drugs for the treatment of CRC patients.
Collection of independent meta-receptors for cross-validation with the
proposed drugs
To select the top-ranked receptor proteins (meta-receptors) associated
with CRC, we reviewed 33 recently published articles and selected the
top-ranked 8 target proteins as the meta-receptors (see Additional file
[107]1: Table S1).
Integrated statistics and bioinformatics approaches
To reach the goal of this study, we applied both statistical and
bioinformatics approaches, as discussed below in detail.
Identification of DEGs by using LIMMA
To identify differentially expressed (DEGs) between tumor and normal
conditions, we considered the linear models for microarray (LIMMA) data
analysis suggested by Smith [[108]61], which can be written as
[MATH:
zg=Yθ<
/mi>g+ug
:MATH]
1
where
[MATH: zg=zg1
,zg2,
…,zgn/ :MATH]
is the vector of expressions (responses) for gth gene with
n = n[1] + n[2] samples (g = 1, 2, …, m), Y is an n × 2 design matrix,
[MATH: θg=θg1
,θg2/ :MATH]
is 2 × 1 vector (2 < n) of effects for two different groups of n
samples, and the error vector
[MATH:
ug∼N(0,Wg
σg2 :MATH]
). Here
[MATH: Wg :MATH]
is a positive definite weight matrix. We want to test the null
hypothesis (H[0]):
[MATH:
θg1=θg2=
>γg=(θg1-θg2)=0 :MATH]
(that is, gth gene is equally expressed gene (EEG) in both case and
control groups) against the alternative hypothesis (H[1]):
[MATH:
θg1≠θg2=
>γg≠0
:MATH]
(that is, the gth gene is a DEG between case and control groups). To
test H[0] against H[1], the moderated t-statistic was formulated by
hybridizing the classical and Bayesian approaches in which the
posterior variance is substituted into the classical t-statistic in
place of the classical sample variance. The moderated t-statistic was
defined as
[MATH: t~g=
z^g-z
mi>gs~gδ
g, :MATH]
2
which follows t-distribution with
[MATH:
dg+d0<
/mn> :MATH]
degrees of freedom under H[0].
Adjusted P values based on the moderated t-statistics and the average
of log fold-change (aLog2FC) values of the treatment group with respect
to the control group were used to select DEGs as follows
[MATH:
DEGg=DEGUpregulated,ifadj.p
.value0.01andaLogFC1.0<
mtd
columnalign="left">DEG
mi>Downregulated,
ifadj.p
.value<<
mn>0.01andaLogFC<-1.0 :MATH]
3
where
[MATH:
aLog2FC=1n1∑in
1log2(zgi
mi>T)-1n2∑jn
2log2(zgj
mi>C),i
mi>fn1≠<
mi>n21n
∑inlog2zgiTzg
jC,ifn1=<
mi>n2=n :MATH]
Here
[MATH:
zgiT<
/mi> :MATH]
and
[MATH:
zgjC<
/mi> :MATH]
are the expressions for the gth gene with the ith treatment and jth
control samples, respectively. We implemented this algorithm using
LIMMA r-package to calculate the P values [[109]62] and aLogFC values
to select the DEGs significantly from four gene expression datasets as
introduced previously. We separated upregulated and downregulated DEGs
for each of four datasets. Then we selected common upregulated and
downregulated DEGs for all of four datasets. Then we combine common
upregulated DEGs and common downregulated DEGs to construct the common
DEGs (cDEGs) set.
Construction of PPI network to identify CRC-causing key genes (KGs)
Protein–protein interaction (PPI) network was constructed to identify
common key-genes (KGs). The online STRING-v11 [[110]63] database was
used to construct the PPI network of cDEGs. The String database
provides critical assessment and integration of protein interactions,
including direct (physical) and indirect (functional) associations. To
construct a PPI network, the distance ‘D’ between pair of proteins
(u,v) is calculated as
[MATH: Du,v=2Nu∩N
mi>v|Nu+|Nv<
/mfenced> :MATH]
4
where N[u] is the neighbor set of u and N[v] is the neighbor set of v.
Cytoscape plugin cytoHubba was used to rank the nodes of the PPI
network, which could be utilized to identify KGs in the network
[[111]64, [112]65]. In the present study, five topological methods,
including Degree [[113]66], BottleNeck [[114]67], Betweenness
[[115]68], and Stress [[116]69] was utilized to identify KGs.
In-silico validation of CRC-causing KGs
An attempt was made to validate the CRC-causing KGs by using different
web-tools and independent databases as introduced below.
Expression analysis for KGs by GEPIA web-tool with TCGA RNA-seq data
To validate the expression levels of key genes, a gene expression
profiling interactive analysis (GEPIA) tool
([117]http://gepia.cancer-pku.cn/) was used to explore the related data
in TCGA databases, and to analyse the expression levels of key genes in
CRC tissues compared with normal tissues [[118]70].
Association of KGs with the immune infiltration levels in different cancers
including CRC
Tumor Immune Estimation Resource (TIMER) is an integrative resource for
investigating the molecular characterization of tumor-immune
interactions across various cancer types
([119]https://cistrome.shinyapps.io/timer/) [[120]71]. TIMER utilizes a
deconvolution statistical method to deduce the abundance of six
tumor-infiltrating immune cells, including B cells, CD4^+ T cells,
CD8^+ T cells, macrophages, neutrophils and DCs from The Cancer Genome
Atlas (TCGA).
DNA methylation of KGs
MethSurv is used to explore methylation biomarkers associated with the
survival of various human cancers [[121]72]. MethSurv is freely
available at [122]https://biit.cs.ut.ee/methsurv. Through the MethSurv
website, we will analyze the DNA methylation analysis of the selected
CRC-related genes in the TCGA database.
Association of KGs with different disease
The Disease-KGs enrichment analysis was performed using the Enrichr web
tool [[123]73] with DisGeNET database [[124]74] to explore other
disease risk factors for CRC patients.
Prognostic power analysis of KGs
To investigate the prognostic power of KGs, we performed cluster
analysis, survival analysis and developed two prediction models using
random forest (RF) and AdaBoost classifiers. The survival curve and ROC
curve were used to assess the prognosis performance. The online
SurvExpress computational tool [[125]75] was used to produce a survival
curve. The r-packages ‘gplots’ and ‘ROCR’ were used to produce heatmap
and ROC curve, respectively. Exploring drugs by molecular docking
simulation.
Exploring GO and KEGG pathway terms that are associated with DEGs including
KGs
The GO (Gene Ontology) functions [[126]76] and KEGG (Kyoto Encyclopedia
of Genes and Genomes) pathway enrichment analysis [[127]77] were
performed to explore CRC-causing ontology terms (Biological Process
(BP), Cellular Component (CC), and Molecular Function (MF)) and
pathways that are associated with cDEGs including KGs. To explore the
significantly enriched GO terms and KEGG pathways by cDEGs including
KGs, let S[i] is the annotated gene-set corresponding to the ith type
of biological functions or pathways given in the database, and M[i] is
the number of genes in S[i] (i = 1, 2,…,r); N is the total number of
annotated genes those construct the entire combine set
[MATH:
S=∪i=
mo>1rSi=Si∪Sic :MATH]
such that
[MATH: N≤∑i=1
rMi; :MATH]
where
[MATH: Sic
:MATH]
is the complement set of S[i]. Again, let n is the total number of
cDEGs of interest and k[i] is the number of cDEGs belonging to the
annotated gene-set S[i]. This problem is summarized by the following
contingency table (Table [128]2):
Table 2.
Contingency table
Annotated gene-sets (given in the GO terms or KEGG pathway databases)
cDEGs (proposed) CEEGs (proposed) Marginal total
ith GO term/KEGG pathway (S[i]) k[i] M[i] − k[i] M[i]
Complement of S[i] (
[MATH: Sic
:MATH]
) n − k[i] N − M[i] − n + k[i] N − M[i]
Marginal total n N − n N (Grand total)
[129]Open in a new tab
To find the significantly enriched GO terms and KEGG pathways by our
proposed cDEGs, the P value was calculated by the Fisher exact test
statistic based on the hypergeometric distribution. We used Enrichr
online tool to perform Fisher exact test [[130]78].
Regulatory network analysis of KGs
To identify key transcription factors (TFs) as the transcriptional
regulators of KGs, the TFs-KGs interaction network was constructed
using the publicly available database JASPAR [[131]79]. The interaction
network was generated using NetworkAnalyst [[132]80]. To identify key
microRNAs (miRNAs) as the post-transcriptional regulators of KGs, the
KGs-miRNAs interaction network was constructed by using the publicly
available online tool TarBase v8.0 (Release 7.0) [[133]81]. The top
degree miRNAs were selected from the networks (miRNAs-KGs) and
considered as key miRNAs.
Molecular docking simulation for exploring candidate drug agents
To explore efficient FDA approved repurposed drugs for the treatment of
CRC patients, we employed molecular docking simulation between the
target receptor proteins and drug agents. We considered our proposed
KGs based hub-proteins and associated TFs proteins as the drug target
receptor proteins and meta-drug agents collected from online databases
and published articles for docking analysis. The molecular docking
simulation requires 3-Dimensional (3D) structures of both receptor
proteins and candidate drugs. We downloaded the 3D structure of all
targeted receptor proteins from Protein Data Bank (PDB) [[134]82] and
SWISS-MODEL [[135]66]. The 3D structures of drug agents were downloaded
from the PubChem database [[136]83]. The 3D structure of the target
proteins was visualized using Discovery Studio Visualizer 2019
[[137]84], and the water molecules, co-crystal ligands which were bound
to the protein were removed. Further, the protein was prepared using
Swiss-PdbViewer [[138]85] and AutoDock Vina [[139]86] in PyRx
open-source software by adding charges and minimizing the energy of the
protein and subsequently converting it to pdbqt format [[140]86,
[141]87]. The exhaustiveness parameter was set to 8. The Discovery
Studio Visualizer 2019 was used to analyze the docked complexes for
surface complexes, types and distances of non-covalent bonds. Let A[ij]
denotes the binding affinity between ith target protein (i = 1, 2, …,
m) and jth drug agent (j = 1, 2, …, n). Then target proteins are
ordered according to the descending order of row sums
[MATH:
∑j=1nAij :MATH]
, j = 1, 2, …, m, and drug agents are ordered according to the
descending order of column sums
[MATH:
∑i=1mAij :MATH]
, j = 1,2, …, n, to select the top ranking few drug agents as the
candidate drugs. Then we validated the proposed repurposed drugs by
molecular docking simulation with the top ordered independent receptor
proteins associated with CRC published by others.
Results
Identification of cDEGs
We identified 50 cDEGs, including 19 up-regulated (Fig. [142]2A) and 31
down-regulated (Fig. [143]2B) genes in CRC tissue, using
adj.P.Val < 0.01 and logFC > 1 as the threshold for down-regulated
cDEGs, and adj.P.Val < 0.01 and logFC < -1 for up-regulated cDEGs. The
down and up regulated cDEGs were displayed on the right and left sides
respectively in the volcano plot (Fig. [144]3 and Additional file
[145]2).
Fig. 2.
[146]Fig. 2
[147]Open in a new tab
Common DEGs (cDEGs) among the five GEO datasets for A up-regulated and
B downregulated
Fig. 3.
[148]Fig. 3
[149]Open in a new tab
The five GEO datasets volcano plots of A [150]GSE110224, B
[151]GSE50760, C [152]GSE35279, D [153]GSE23878 and E [154]GSE9348. Ass
color point are Not Significant (NS) according to Log[2]FC and P value
threshold, green color is Log[2]FC (Log[2]FC < − 1 and Log[2]FC > 1),
blue color is P value ≤ 0.05, and red color points are satisfying the
Log[2]FC and P value threshold
Identification of key genes (KGs) from cDEGs
The PPI network of cDEGs was constructed using the STRING database,
which includes 49 nodes and 175 edges, with an average node degree of
6.73 and P value < 1.0e−16. In the PPI network, Red color indicates
up-regulated and black color indicates down-regulated cDEGs, big size
and octagon shape indicate common key genes (KGs) (Fig. [155]4). We
used four topological measures (Degree, BottleNeck, Betweenness, and
Stress) to select top-ranked 11 KGs (Table [156]3) that are CXCL8,
MMP7, CA4, ADH1C, GUCA2A, GUCA2B, CEMIP, ZG16, CLCA4, MS4A12 and CLDN1,
where 4 KGs (CXCL8, CEMIP, CLDN1, and MMP7) were up-regulated and the
rest 7 KGs were downregulated.
Fig. 4.
[157]Fig. 4
[158]Open in a new tab
Network of PPIs for common cDEGs that have been identified. Red color
nodes and upregulated and black color nodes are downregulated. The
outer circle of the image is common key genes (KGs)
Table 3.
Selection of KGs by combining the top ranked genes of five topological
measurements with the PPI network
Degree (A) BottleNeck (B) Betweenness (C) Stress (D) Key genes (KGs) (
[MATH: A∪B∪C∪D) :MATH]
GUCA2A CLDN1 CXCL8 CLDN1 GUCA2A, GUCA2B, CLDN1, CLCA4, MS4A12, MMP7,
CEMIP, CXCL8, ADH1C, ZG16, CA4
GUCA2B CXCL8 CLDN1 GUCA2A
CLDN1 CLCA4 GUCA2A GUCA2B
CLCA4 MMP7 MMP7 CXCL8
MS4A12 ZG16 CA4 CA4
[159]Open in a new tab
In-silico validation of CRC-causing KGs by using different web-tools and
independent databases
Expression analysis for KGs by GEPIA web-tool with TCGA RNA-seq data
In the GEPIA database, differences in transcriptional expression of the
hub gene between CRC tissues and normal tissues were again verified.
Combining with the box plot results, eleven potential KGs further were
screened out. Based on the GEPIA database to test the relative
expression of KGs mRNA, it was determined that our proposed KGs (CXCL8,
CEMIP, MMP7, CA4, ADH1C, GUCA2A, GUCA2B, ZG16, CLCA4, MS4A12 and CLDN1)
may be closely related to the occurrence and development of CRC
(Fig. [160]5).
Fig. 5.
[161]Fig. 5
[162]Open in a new tab
The expression level of hub genes in CRC. A ADH1C; B CA4; C CEMIP; D
CLCA4; E CLDN1; F CXCL8; G GUCA2A; H GUCA2B; I MMP7; J MS4A12 and K
ZG16. The red and gray boxes represent cancer and normal tissues,
respectively. Colon adenocarcinoma (COAD) and rectum adenocarcinoma
(READ)
Correlation between KGs and immune infiltration levels in different cancers
including CRC
We investigated the relationship of different tumors ininfiltrates
immune cell types (B cell, CD8 T cell, CD4 T cell, neutrophil,
macrophage and dendritic cell (DC)) with the expressions of KGs
(Additional file [163]3). We observed (Additional file [164]4) that our
proposed KGs are significantly associated with different tumor
infiltrates immune cells under different databases of COAD (colon
adenocarcinoma) and READ (Rectum adenocarcinoma). Compelling evidence
has demonstrated that tumor-infiltrating lymphocytes are significantly
associated with survival in cancer. Therefore, we investigated whether
KGs expression was related to immune infiltration levels in lung cancer
by TIMER. Tumor purity is an important factor affecting the analysis of
immune infiltration. Interestingly, our results indicated that KGs
expression was correlated with poor prognosis and high immune
infiltration in CRC. KGs were highly expressed in monocytes
(non-classical and classical) and B cells (naïve). In contrast, KGs
expression was not significantly correlated with tumor purity or
infiltrating levels of CD8^+ T cells, CD4^+ T cells or neutrophils in
CRC. CA4 expression levels were positively correlated with infiltrating
levels of B cells (r = 0.22, P = 2.47E−04), CD8^+ T cells (r = − 0.16,
P = 7.49E−03), CD4^+ T cells (r = 0.19, P = 1.99E−03), macrophages
(r = − 0.18, P = 2.19E−03), neutrophils (r = 0.38, P = 4.6E−11) and DCs
(r = 0.23, P = 9.73E−05) in COAD (Additional files [165]3 and [166]4).
CLCA4 expression levels were also positively correlated with
infiltrating levels of B cells (r = 0.32, P = 1.92E−03), CD8^+ T cells
(r = − 0.23, P = 2.95E−02), CD4^+ T cells (r = 0.39, P = 1.36E−04),
macrophages (r = − 0.48, P = 8.42E−07), neutrophils (r = 0.5,
P = 4.50E−07) and DCs (r = 0.33, P = 1.10E−03) in READ (Additional file
[167]3 and [168]4). These findings strongly suggest that KGs plays an
important role in immune infiltration in CRC, especially infiltration
of Macrophage, T cell CD8^+, T cell CD4^+, Neutrophil, Myeloid
dendritic cell, and B cell.
DNA methylation of KGs
DNA methylation at CpG (CG) sites play the vital role in cancer
progression. Therefore, we investigated DNA methylation of KGs (CXCL8,
CEMIP, MMP7, CA4, ADH1C, GUCA2A, GUCA2B, ZG16, CLCA4, MS4A12 and CLDN1)
at CpG sites by MethSurv web-tool with TCGA database. We observed that
seven KGs (CEMIP, MMP7, CA4, GUCA2B, ZG16, CLCA4, MS4A12) are
significantly methylated at CpG sites (Table [169]4). The
hypermethylation/downregulation gene CEMIP has six CpG sites with a P
value < 0.05, the hypomethylation/upregulation gene GUCA2B has four CpG
sites with a P value of < 0.05, the hypomethylation/upregulation gene
MS4A12 has two CpG sites with a P value < 0.05, and the
hypomethylation/upregulation gene MMP7, CLCA4, ZG16 has one CpG site
with a P value of < 0.05, which is statistically significant
(Table [170]4). We found that the difference in DNA methylation between
CG12358698 of CEMIP, CG23532119 of MS4A12, CG00656728 of GUCA2B,
CG24963041 of MMP7, CG26310643 of CLCA4, CG09229061 of ZG16, CG00200645
of CA4 and CG07510230 of ZNRF2 was most pronounced.
Table 4.
The significant prognostic value of CpG in three key genes
Gene-CpG HR P value
CEMIP-Body-Open_Sea-CG12358698 2.657 0.001
CEMIP-Body-Open_Sea-CG12098156 2.275 0.001
CEMIP-Body-Open_Sea-CG04847610 1.899 0.008
CEMIP-Body-Open_Sea-CG17820039 3.085 0.027
CEMIP-Body-Open_Sea-CG21838329 2.665 0.045
CEMIP-5'UTR-Open_Sea-CG09579081 3.836 0.019
MMP7-TSS1500-Open_Sea-cg24963041 1.822 0.016
MS4A12_5'UTR-Open_Sea-cg09257456 0.196 0.003
MS4A12_TSS200-Open_Sea-cg23532119 4.164 0.009
CLCA4-Body-Island-cg26310643 0.259 0.018
ZG16-TSS1500-Open_Sea-cg09229061 6.022 0.001
CA4-Body-Open_Sea-cg00200645 3.173 0.022
GUCA2B-TSS1500-Open_Sea-cg00656728 11.395 0.001
GUCA2B-TSS200-Open_Sea-cg10179693 3.585 0.009
GUCA2B-TSS200-Open_Sea-cg14848143 3.185 0.023
GUCA2B-1stExon-Open_Sea-cg19728577 7.457 0.001
[171]Open in a new tab
Association of KGs with different diseases including CRC
The disease-KGs interaction analysis showed that KGs are significantly
associated with different types of colon or rectal cancers including
Malignant tumor of colon, Colonic Neoplasms, Adenomatous Polyps,
Adenocarcinoma, Adenoma of large intestine, Colorectal Neoplasms,
Adenocarcinoma of colon, Colon Carcinoma, Stage III Colon Cancer AJCC
v7, Stage III Colon Cancer, Intestinal Neoplasms, Adenoma and
Metastatic Neoplasm (Fig. [172]6 and Table S2 in Additional file
[173]1).
Fig. 6.
Fig. 6
[174]Open in a new tab
KGs-Diseases interaction, where blue color highlighted risk factors are
CRC related
Prognostic power analysis
We considered both supervised and unsupervised learnings, including
multivariate survival analysis, to investigate the prognostic power of
KGs. Figure [175]7A shows that KGs can separate case and control
samples accurately by the unsupervised hierarchical clustering (HC).
The multivariate survival curves, based on the expressions of 11 KGs,
separated the low and high-risk groups significantly (Fig. [176]7B). In
the case of supervised learning, at first, we considered the expression
profiles of 11 KGs from three datasets ([177]GSE9348, [178]GSE23878 and
[179]GSE110224) that contained 60 tumors and 50 control samples in
total. Then we partitioned these datasets in to training (70%) and test
(30%) sets. Then we trained one popular classifier known as random
forest (RF). To test the prediction performance of the model, we also
considered the expressions of 11 KGs from another two dataset
[180]GSE35279 and TCGA as the independent test set. Figure [181]7C
showed the ROC curves based on the train, test performance, and
independent test dataset of RF prediction model. The AUC values (area
under the ROC curve) for RF were 1.00 with train data, 0.988 with test
data, 0.943 with independent test data and 0.90 with TCGA dataset.
Thus, both prediction models based on RF classifiers showed good
performance for each of the dependent and independent test datasets of
KGs.
Fig. 7.
[182]Fig. 7
[183]Open in a new tab
The prognostic powers of KGs were displayed by A a Heatmap of
hierarchical clustering, B multivariate survival curves with KGs, and C
ROC curves of prediction models with KGs
Exploring CRC-causing GO and KEGG pathway terms that are associated with
cDEGs including KGs
The GO functional enrichment analysis of showed that 185 GO-BP terms, 9
GO-CC terms and 38 GO-MF terms are enriched by the cDEGs genes, where
KGs were involved with 57 BPs, 6 CCs and 21 MFs. Among the enriched GO
functions including KGs, 6 GO-BP terms (GO:0034,31 ~ cell junction
maintenance, GO:0098742 ~ cell–cell adhesion via plasma-membrane
adhesion molecules, GO:0045216 ~ cell–cell junction organization,
GO:0008285 ~ negative regulation of cell population proliferation,
GO:0030334 ~ regulation of cell migration, and GO:0048565 ~ digestive
tract development), 5 GO-CC terms (GO:0046658 ~ anchored component of
plasma membrane, GO:0062023 ~ collagen-containing extracellular matrix,
GO:0005923 ~ bicellular tight junction, GO:0043296 ~ apical junction
complex, and GO:0005911 ~ cell–cell junction) and 6 GO-MF terms
(GO:0005179 ~ hormone activity, GO:0030250 ~ guanylate cyclase
activator activity, GO:0048018 ~ receptor ligand activity,
GO:0005254 ~ chloride channel activity, GO:0008237 ~ metallopeptidase
activity, and GO:0045236 ~ CXCR chemokine receptor binding) were
reported by other researchers as the BPs of CRC (see Table [184]3 and
discussion section for more details). The KEGG pathway enrichment
analysis of cDEGs showed that 8 pathways are enriched by the KGs. Among
them, KGs involving Nitrogen metabolism, Proximal tubule bicarbonate
reclamation, Cell adhesion molecules, Pathogenic Escherichia coli
infection, Human T-cell leukemia virus 1 infection, Amoebiasis,
Leukocyte transendothelial migration, and Cytokine-cytokine receptor
interaction was also reported by other researchers as the pathways of
CRC development (see Table [185]5 and discussion section for more
details as before).
Table 5.
Top Enriched gene ontology (GO) terms and KEGG pathways by the proposed
cDEGs highlighting cKGs
Term Overlap P value cKGs
Biological process
Cell junction maintenance (GO:0034331) [[186]88] 2/14 6.14E−04 CLDN1
Cell–cell adhesion via plasma-membrane adhesion molecules (GO:0098742)
[[187]89] 4/170 0.001067 CLDN1
Calcium-independent cell–cell adhesion via plasma membrane
cell-adhesion molecules (GO:0016338) [[188]89] 2/20 0.00127 CLDN1
Cell–cell junction organization (GO:0045216) [[189]89] 3/82 0.001342
CLDN1
Negative regulation of cell population proliferation (GO:0008285)
[[190]90] 5/379 0.003239 CXCL8
Regulation of cell migration (GO:0030334) [[191]91] 5/408 0.00443
MMP7;CLDN1
Molecular function
Hormone activity (GO:0005179) [[192]92] 5/78 1.96E−06 GUCA2A
Guanylate cyclase activator activity (GO:0030250) [[193]93] 2/5
6.85E−05 GUCA2B;GUCA2A
Receptor ligand activity (GO:0048018) [[194]94] 6/307 1.56E−04 GUCA2A
Chloride channel activity (GO:0005254) [[195]95] 2/64 0.012507 CLCA4
Metallopeptidase activity (GO:0008237) [[196]96] 2/121 0.040942 MMP7
CXCR chemokine receptor binding (GO:0045236) [[197]97] 1/17 0.044125
CXCL8
Cellular component
Anchored component of plasma membrane (GO:0046658) [[198]98] 2/46
0.006619 CA4
Collagen-containing extracellular matrix (GO:0062023) [[199]99] 4/380
0.018081 ZG16
Bicellular tight junction (GO:0005923) [[200]100] 2/78 0.018197 CLDN1
Apical junction complex (GO:0043296) [[201]101] 2/98 0.027852 CLDN1
Cell–cell junction (GO:0005911) [[202]102] 3/271 0.035073 CLDN1
KEGG
Nitrogen metabolism [[203]22] 1/7 9.13E−04 CA4
Proximal tubule bicarbonate reclamation [[204]103] 1/23 0.001682 CA4
Cell adhesion molecules [[205]104] 3/148 0.007098 CLDN1
Pathogenic Escherichia coli infection [[206]105] 3/197 0.015369
CXCL8;CLDN1
Amoebiasis [[207]44] 2/102 0.029984 CXCL8
Leukocyte transendothelial migration [[208]106] 2/114 0.036749 CLDN1
Cytokine-cytokine receptor interaction [[209]107] 3/295 0.043339 CXCL8
[210]Open in a new tab
Regulatory network analysis of KGs
We constructed KGs versus transcription factors (KGs-TFs) interaction
network to identify top ranking few TFs as the key transcriptional
regulators of KGs. We selected the top 4 key TFs (FOXC1, YY1, GATA2 and
NFKB1) as the vital transcriptional regulators of KGs with degree ≥ 4,
where the green color rectangle indicates top degree key TFs and, red
and black color ellipse indicates KGs (Fig. [211]8A). To identify top
ranking few micro-RNA (miRNA) as the key post-transcriptional
regulators of KGs, we constructed a KGs-miRNAs interaction network. We
selected the top 8 key miRNAs (hsa-mir-16-5p, hsa-mir-195-5p,
hsa-mir-203a-3p, hsa-mir-34a-5p, hsa-mir-107, hsa-mir-27a-3p,
hsa-mir-429, and hsa-mir-335-5p) as the vital regulators of KGs with
degree ≥ 4, where green color rectangle indicates top degree key miRNAs
and, red and black color ellipse indicates KGs (Fig. [212]8B).
Fig. 8.
[213]Fig. 8
[214]Open in a new tab
KGs regulatory network analysis results A KGs-TFs interaction network
to identify key transcriptional regulators of KGs, B KGs-miRNAs
interaction network to identify key post-transcriptional regulators of
KGs. Here red and black color ellipse indicates the KGs in both A and
B, green color bigger size rectangle indicates key TFs in A and key
miRNAs in B
Exploring candidate drug agents by molecular docking simulation
To explore candidate drugs for CRC, we considered 11 KGs based proteins
(CXCL8, MMP7, CA4, ADH1C, GUCA2A, GUCA2B, CEMIP, ZG16, CLCA4, MS4A12
and CLDN1) and its regulatory key 4 TFs proteins (FOXC1, YY1, GATA2 and
NFKB1) as the m = 15 drug target receptors. The 3-Dimension (3D)
structure of CXCL8, MMP7, ZG16, CA4, YY1 and NFKB1 were downloaded from
Protein Data Bank (PDB) with the PDB codes 6N2U, 1MMQ, 3APA, 5KU6, 4C5I
and 1NFI and the rest of them, such as GUCA2A, GUCA2B, CLDN1, CLCA4,
MS4A12, FOXC1 and GATA2 targets were downloaded from SWISS-MODEL using
UniProt with IDs [215]Q02747, [216]Q16661, [217]O95832, [218]Q14CN2,
[219]Q9NXJ0, [220]Q12948, and [221]P23769 respectively. Then we
considered 92 meta-drug molecules from the DSigDB database and 64
meta-drugs from the published articles and the Food and Drug
Administration (FDA) as drug agents. The 3D structures of drug agents
were downloaded from the PubChem database. Then we performed a
molecular docking simulation between our proposed receptors and
meta-drug agents. The binding affinity score matrix between the ordered
receptors and ordered drug agents were displayed in Fig. [222]9A. We
observed that Cyclosporin A produces highly significant binding
affinity scores with all m = 15 target proteins, and their average
binding affinity scores across all targets were − 9.46 (kcal/mol). The
2th and 3th top ordered drugs (Manzamine A and Cardidigin) produced
highly significant binding affinity scores with 14 target proteins, and
their average binding affinity scores across all m = 15 targets were
− 8.22 and − 8.19, respectively. The 4th to 10th top ordered drug
Staurosporine, Benzo[A] Pyrene, Sitosterol, Nocardiopsis Sp,
Troglitazone, K-252a, and Riccardin D produced significant binding
affinity scores with 14 target proteins, and the average binding
affinity score was − 7.76, − 7.71, − 7.69, − 7.68, − 7.66, − 7.64, and
− 7.62 respectively. The other drugs (lead compounds) produced
significant binding affinity scores with less than 13 target proteins
out of 15, and their average binding affinity scores were negatively
smaller than − 7.5. Therefore, we considered the top ordered nine drugs
(Cyclosporin A, Manzamine A, Cardidigin, Staurosporine, Benzo[A]Pyrene,
Sitosterol, Nocardiopsis Sp, Troglitazone and Riccardin D) as the
candidate drugs in our study. We also examined their complete
interaction profile, including hydrogen bonds, hydrophobic, halogen/
salt Bridge and electrostatic interactions in Table [223]6.
Fig. 9.
[224]Fig. 9
[225]Open in a new tab
Molecular docking simulation results for exploring candidate drugs
against CRC. A Image of binding affinity scores of proposed ordered
receptor proteins with the top 50 ordered. B Image of binding affinity
scores of the top-ranked independent receptors published by others with
the top 50 ordered
Table 6.
The 3-dimension view of strong binding interactions between targets and
drugs is shown in the 4th column
Potential targets Structure of lead compounds Binding affinity
(kCal/mol) The 3d view and interactions of complex Interacting amino
acids
Hydrogen bond Hydrophobic interactions Electrostatic
MMP7
Cardidigin
graphic file with name 12920_2023_1488_Figa_HTML.gif
− 10.4 graphic file with name 12920_2023_1488_Figb_HTML.gif LEU181,
THR189, ASN179, GLU219 HIS218, PHE103, PHE185 –
CA4
Manzamine A
graphic file with name 12920_2023_1488_Figc_HTML.gif
− 10.8 graphic file with name 12920_2023_1488_Figd_HTML.gif HIS4,
TYR11, HIS4 HIS4, HIS4 –
ADH1C
Cyclosporin A
graphic file with name 12920_2023_1488_Fige_HTML.gif
− 11.7 graphic file with name 12920_2023_1488_Figf_HTML.gif ALA317
LEU116, ILE318, PHE93 –
[226]Open in a new tab
Key interactions amino acids and their binding types with potential
targets were shown in the last column
Performance investigation of proposed drugs by cross-validation with the
top-ranked independent receptors
To investigate the resistance performance of our proposed 9 candidate
drugs against the state-of-the-art alternative receptors for CRC by
molecular docking, we considered the top-ranked 8 independent receptors
(MYC, CDK1, CXCL1, CXCL8, CXCL12, TIMP1, AURKA, and TOP2A) published by
others in different 36 articles for CRC (see Additional file [227]1:
Table S1), where the receptor CXCL8 was common with our proposed
receptor. The 3D structure of MYC, CDK1, CXCL12, TIMP1, AURKA, and
TOP2A was downloaded from the PDB database with the PDB codes 6G6K,
6GU2, 6SHR, 2J0T, 6VPM, AND 1ZXM, respectively and for another one
CXCL1, downloaded from SWISS-MODEL using UniProt with ID [228]P09341.
The we performed molecular docking analysis of 7 independent receptors
with all of 160 drug agents. Figure [229]9B showed that our proposed 9
candidate drugs are also detected as the independent receptor-guided
top-ranked 9 drugs. Therefore, we can strongly recommend that the
proposed drugs might be more effective candidates than the other drugs
for the treatment against CRC.
Connectivity map (CMap) analysis to discover the mechanism of action of drug
agents
In an effort to elucidate its mechanism of action, we defined a
signature for Troglitazone, Cardidigin and Staurosporine. High
connectivity scores were found for multiple instances of five heat
shock protein inhibitors: Angiotensin receptor antagonist,
Topoisomerase inhibitor, Glycogen synthase kinase inhibitor, DNA
dependent protein kinase inhibitor, and MTOR inhibitor. Despite the
differences in the cells used to generate the query signature and
reference profiles, the three highest-scoring compounds in the Con
nectivity Map were Troglitazone, Cardidigin (Digitoxin use this name to
use in Cmap) and Staurosporine (Fig. [230]10A). More important, the
Connectivity Map also revealed strong connectivity with ten
structurally distinct compounds, mocetinostat, ryuvidine, cyclopamine,
dorsomorphin, JNJ-7706621, quinoclamine, SU-11652, bisacodyl,
alvocidib, and rottlerin respective inhibitor are show in
Fig. [231]10B. Cyclopamine and alvocidib compounds are not connected
with Troglitazone.
Fig. 10.
[232]Fig. 10
[233]Open in a new tab
The Connectivity Map for three smalls molecules of Troglitazone,
Cardidigin and Staurosporine
AK3, OAZ2, NDUFAF4, EGFR, GNE, MAPK14, CSNK1G2, HSP90AB1, ZNF449, and
GATAD1 genes depicts high positive connectivity with each of the drugs
Troglitazone, Cardidigin and Staurosporine and their median
connectivity score belongs to 97.96–97.53 (out of ± 100) which display
in the Fig. [234]10C with corresponding enriched pathways of the
connected gene. Moreover, the drug staurosporine was positively
connected with 6 other genes namely OTUD3, TCF7L1, C2, TAAR1, PRDM1,
and BMP2 with corresponding enriched pathways.
Discussion
The molecular mechanism of colorectal cancer (CRC) is not yet
completely clear to the researchers. So potential molecular signatures
are required to disclose molecular mechanisms of CRC and its
therapeutic agents. The integrated statistics and bioinformatics
analyses are now widely using to explore potential molecular signatures
of malignant tumors [[235]108]. Transcriptomics analysis is a popular
way of identifying DEGs between normal and tumor tissue samples
[[236]109]. Therefore, in this study, we considered the integrated
bioinformatics analyses for exploring common genomic biomarkers from
five transcriptomics profiles ([237]GSE9348, [238]GSE35279,
[239]GSE23878, [240]GSE110224 and [241]GSE50760) for diagnosis,
prognosis, and therapies of CRC. At first, we identified 11 KGs (CXCL8,
MMP7, CA4, ADH1C, GUCA2A, GUCA2B, CEMIP, ZG16, CLCA4, MS4A12, and
CLDN1) by PPI analysis of 50 common DEGs (cDEGs). Some literature
reviews also agreed with our results that these KGs are associated with
CRC [[242]14–[243]45, [244]54] (see Fig. [245]11A). For example, Li et
al. [[246]110] reported that the gene CXCL8 plays a vital role in CRC
progression by mediating the differentiation, proliferation, and
apoptosis within a regulatory network. So, they suggested this gene as
a drug target for CRC also [[247]110]. Chen and Ke [[248]111] detected
the gene MMP7 as a potential biomarker of CRC by bioinformatics
analysis. Another study found that it regulates cancer progression and
mediate the differentiation, proliferation, invasion and metastasis of
various cancer cell types by different mechanisms [[249]112]. A study
reported that CA4 is a newly identified tumor suppressor gene in CRC by
targeting the WTAP–WT1–TBL1 axis through the inhibition of the Wnt
signaling pathway [[250]113]. The gene ADH1C might lead the increasing
production of proinflammatory mediators by decreasing its expressions
in the ulcerative colitis colon through the activation of the
STAT1/NF-κB signaling pathway [[251]114].
Fig. 11.
[252]Fig. 11
[253]Open in a new tab
Validation of the proposed KGs (receptors) and candidate drugs in favor
of CRC by the literature review. A Validation of the proposed KGs:
circles with green color indicate downregulated KGs, and pink color
indicates up-regulated KGs, and each connected network with a circle
indicates the reference in which the cKG is associated with CRC, B
validation of the proposed candidate drugs: circles with green color
indicate FDA approved and investigational drugs, purple color indicate
investigational drugs and red color indicate unapproved drugs and each
connected network with a circle indicates the references in which our