Abstract
Microarray analysis based on gene coexpression is widely used to
investigate the coregulation pattern of a group (or cluster) of genes
in a specific phenotype condition. Recent approaches go one step beyond
and look for differential coexpression pattern, wherein there exists a
significant difference in coexpression pattern between two phenotype
conditions. These changes of coexpression patterns generally arise due
to significant change in regulatory mechanism across different
conditions governed by natural progression of diseases. Here we develop
a novel multiobjective framework DiffCoMO, to identify differentially
coexpressed modules that capture altered coexpression in gene modules
across different stages of HIV-1 progression. The objectives are built
to emphasize the distance between coexpression pattern of two phenotype
stages. The proposed method is assessed by comparing with some
state-of-the-art techniques. We show that DiffCoMO outperforms the
state-of-the-art for detecting differential coexpressed modules.
Moreover, we have compared the performance of all the methods using
simulated data. The biological significance of the discovered modules
is also investigated using GO and pathway enrichment analysis.
Additionally, miRNA enrichment analysis is carried out to identify TF
to miRNA and miRNA to TF connections. The gene modules discovered by
DiffCoMO manifest regulation by miRNA-28, miRNA-29 and miRNA-125
families.
Introduction
The typical course of HIV infection starts with a acute human
immunodeficiency virus (HIV) infection, also known as primary HIV
infection or acute retroviral syndrome^[26]1, [27]2. In this stage
large amount of viruses are produced in human body and the amounts of
CD4+ cells begins to drop downward^[28]3. Subsequently the immune
response will retrieve the viral load to a certain level, called ‘viral
set point’ which is a marginally stable state of HIV virus load in
human body. A small proportion of HIV infected individual remain
clinically stable for a long periods and have been referred as
long-term nonprogressors^[29]3. In most of the situation the acute
phase of infection progresses to the latent phase (chronic) accompanied
with markedly diminishing CD4+ cell count. This results constitutional
symptoms of HIV in human body that ultimately leads to AIDS.
Microarray based coexpression network analysis has been demonstrated to
be an emerging field to investigate the coregulation pattern in gene
regulation. This also offers an opportunity for discovering the
differences in gene expression pattern across different stages of
disease progression. Several studies have been proposed to reveal the
changes in expression profile by introducing differential expression in
different stages of HIV infection^[30]4, [31]5. In Hyrcza et al. ^[32]6
an analysis is carried out to detect differential regulation pattern of
genes across different stages of early and chronic progression of
HIV-1. They also demonstrated that expression of interferon stimulated
genes is significantly increased across these stages. In a similar
study Li et al. ^[33]7 proposed a framework to investigate stage
specific expression pattern during HIV-1 infection by utilizing
differential expression analysis of lymphatic tissue microarrays. In a
recent study^[34]8 topological characteristics of coexpression modules
are investigated through HIV infection stages. But all these studies
are confined in investigating the extent of differences in expression
patterns over different stages of HIV progression. Recently
coexpression and differential coexpression analysis have been
demonstrated to be powerful methods for exploring coexpression patterns
and investigating intrinsic relationships between a set of coexpressed
genes (or clusters) across different conditions of a specific disease.
Co-expression analysis deals with the degree of coexpression under a
certain condition whereas differential coexpression analysis looks for
the differences in coexpression of a set of gene pairs or gene clusters
across different experimental conditions^[35]9–[36]11. As modules are
the best representative of any network, a proper analysis of
differentially coexpressed modules may reveal significant changes in
coregulation pattern across different stages of disease progression.
Several studies have been developed to study the change in coexpression
patterns across normal and disease states. Computational methodologies
have been developed to investigate differential coexpression (DC)
patterns by finding differentially coexpressed gene pairs or set of
gene clusters (modules) that show a significant change in coexpression
between normal and disease states^[37]9, [38]12–[39]15. Other
approaches seek to identify clusters in one condition and test whether
these modules show change in coexpression patterns in a different
condition. For example CoXpress^[40]12 utilizes hierarchical clustering
to explore the relationship between genes. The gene modules are defined
by cutting the tree at specific level. CoXpress then uses a resampling
technique to identify those groups which are coexpressed in one set of
experiment but not in other. Another approach called DiffCoex^[41]13 is
developed to identify DC modules across two different conditions. To
quantify DC, DiffCoex utilizes a statistical framework. DiffCoex
transforms the DC scores into dissimilarity measures and exploits a
popularly used tool WGCNA (Weighted Gene Coexpression Network
Analysis)^[42]16 for grouping the DC modules. A recently proposed
method called DICER (Differential Correlation in Expression for
meta-module Recovery)^[43]5 aims to identify gene sets whose
correlation patterns differs markedly between disease and control
samples. Dicer goes beyond the convenient approaches of differentially
coexpressed module detection and identifies differentially coexpressed
meta-modules which basically represents a class of modules where each
pair of modules shows a significant change in coexpression patterns,
but retains same coexpression patterns within each module.
Most of the methods are mainly used some scoring function to capture
the differential coexpression and employed some searching technique to
optimize it. Here the differentially coexpressed module detection
problem is modeled as a multiobjective optimization problem, optimizing
different criteria simultaneously. Multiobjective modeling of a problem
is useful as it finds solution by optimizing different criteria
simultaneously. For example, our earlier approach^[44]17 was to develop
a multiobjective model to detect coexpression module by optimizing two
cluster validity index simultaneously. In this article, we develop
DiffCoMO (Differentially COexpressed module detection using
Multi-Objective algorithm) for identifying differentially coexpressed
modules across two different conditions of disease progression. The
coexpression patterns of human CD4+ and CD8+ T cells from HIV infected
individuals at two clinical stages of HIV progression viz., acute and
chronic are considered for this purpose.
DiffCoMO operates on two objective functions. The first one maximizes
modulewise distance between two coexpression networks of two infection
stages. The second one maximizes absolute difference of module
membership value of a gene within a module across two stages. The two
objectives equally contribute for obtaining the Pareto optimal
solutions. For identifying similarity between two gene expression
profile Pearson correlation coefficient is utilized here. The
significance of the identified modules are assessed by applying some
statistical test based on the extent of Differential Coexpression score
(DC_Score). We compare our method with other state-of-the-art
techniques like DiffCoex^[45]13, CLICK^[46]18 and CoXpress^[47]12, and
DICER^[48]5. We demonstrate that our method improves upon the
state-of-the-art in detecting differential coexpression patterns more
accurately. For biological validation we survey the GO terms and KEGG
pathway enrichment of the detected differentially coexpressed modules
to identify some key functionality that play important role in
progression of infection from acute to chronic stage.
Methods
In this section we describe the proposed multiobjective framework for
identifying differentially coexpressed modules across two different
stages of HIV infection. The following sections describe construction
of differential coexpression network and a brief description of
objective functions to detect modules in this network.
Differential coexpression of gene in two phenotypes
In gene coexpression network each gene corresponds to a node and the
correlation between each pair of genes corresponds to an edge between
nodes. Correlation is generally designated as a good measure that is
able to capture positive and negative dependence simultaneously between
a pair of variables. So, to determine similarity between two gene
expression profiles Pearson correlation coefficient is generally used.
So, gene coexpression network corresponds to a of similarity matrix
which encodes the connection strength between two pair of genes^[49]16.
Some adjacency functions are useful to perform soft or hard
thresholding on similarity matrix for building an adjacency matrix
which corresponds to unweighted or weighted network respectively. From
a network perspective the coexpression network can be formally defined
as: G = (V, E, W), where V represents nodes (genes) E represents edges
corresponding to a pair of nodes and W is an weight function defined as
W : E → [−1, 1], which represents the correlation between the pair of
nodes.
Differential coexpression generally deals with the change in
coexpression patterns across two different phenotypes. To deals with
it, one needs to define a measure that approximately reflects
coexpression change of a gene pair. A simple way to compute it is to
calculate the absolute difference of similarity values across two
phenotypes. This can be formally stated as:
[MATH:
DCi,jp1,p2<
/mrow>=|Sim<
mo
stretchy="true">(xi,xj
)p1−Sim(xi,xj
)p2
mrow>|, :MATH]
1
where p [1], p [2] are two different phenotype conditions, and x [i], x
[j] represent expression profile of gene [i] and gene [j] respectively.
Here Sim(x [i], x [j])^p signifies Pearson correlation between x [i]
and x [j] for phenotype p.
Following the same convention we can define differential coexpression
score or DC_Score of a module by averaging the DC values between all
pair of genes within it. The module-wise average DC_score can be
defined as
[MATH:
DC_Scor<
msub>eMi=∑<
/mo>i,jεMiDCi,jp1,p2N, :MATH]
2
where
[MATH:
DCi,jp1,p2<
/mrow> :MATH]
represents differential coexpression score between gene i and gene j in
module M [i] across two phenotypes p1 and p2, and N represent number of
genes in the module. Large value of this metric corresponds to modules
having high differential coexpression.
Here, we have built two coexpression networks by considering expression
profile of genes in acute and chronic stage of infection. A
differential coexpression network is also constructed from these
expression datasets.
The DiffCoMO framework
DiffCoMO is totally based on the GA based multiobjective framework. It
utilizes the popular Non-dominated Sorting GA (NSGA-II) for
constructing the multiobjective framework. The multi-objective
optimization problem can be formally defined as follows^[50]19: find
the vector
[MATH: x¯⁎=
[x1
⁎,<
mi>x2⁎,…,xn⁎]T :MATH]
of the decision variables satisfying the m inequality constraints:
[MATH: gi(x¯)≥0,i=1,2,…,m :MATH]
, and p equality constraints
[MATH: hi(x¯)=0,i=1,2,…,p :MATH]
, that optimizes the vector function
[MATH: f¯(x¯)=[f1(x¯),f2(x¯),…,f<
/mi>k(x¯)]T :MATH]
. The constraints define the feasible region F containing all the
admissible solutions. The vector
[MATH: x¯⁎
:MATH]
denotes an optimal solution in F.
The concept of Pareto optimality ^[51]19 is useful in the domain of
multi-objective optimization. From the viewpoint of minimization
problem the definition of Pareto optimality may be given as follows: A
decision vector
[MATH: x¯⁎
:MATH]
is called Pareto optimal if and only if there is no
[MATH: x¯ :MATH]
that dominates
[MATH: x¯⁎
:MATH]
, i.e., there is no
[MATH: x¯ :MATH]
such that
[MATH: ∀i∈{1,2,…,k},fi(x¯)≤fi(x¯⁎) :MATH]
and
[MATH: ∃i∈{1,2,…,k},fi(x¯)<fi(x¯⁎) :MATH]
. In other words,
[MATH: x¯⁎
:MATH]
is Pareto optimal if there exists no feasible vector
[MATH: x¯ :MATH]
which causes a reduction on some criterion without a simultaneous
increase in at least one another. Pareto optimum usually encompasses a
set of solutions generally called non-dominated solutions.
For building up the chromosome structure DiffCoMO encodes a set of
nodes as a chromosome. For example, a chromosome p [1], p [2] … p [n]
where p [i] is an integer denoting the index of a gene (or node),
represents a differentially coexpressed gene modules. The edge weights
between these nodes represent differential coexpression score. Here,
the population size is taken as 50 and the algorithm runs for 200
generations. All the Pareto optimal solutions obtained in the final
generation are taken as resulting modules.
For starting from a reasonable position we construct the initial
population as a set of modules with high DC scores. For this purpose,
we first dichotomize the differential coexpression matrix by choosing a
threshold of 0.5. Next, we randomly chose some substructure consisting
of all 1s from this resulting binary matrix. To find out all 1s
substructures we apply a biclustering technique^[52]20 and randomly
pick up some of the biclusters consisting of all 1s. Here, union of
rows and columns of each bicluster is treated as a chromosome, which
comprise the initial population.
Representation of objective functions
To construct the objective functions we have emphasized the change of
coexpression pattern of a pair of genes across the acute and chronic
stages of HIV infection. Here, we have two correlation matrices which
represent the expression similarity among each pair of gene expression
profiles across the two infection stages.
For constructing the first objective function we have computed the
distance between two correlation matrices. Let, M [i](pk) represent
correlation matrix of module M [i] at stage pk. Consider the inner
product between two matrices M [i](p1) and M [j](p2) of infection
stages p1 and p2 which satisfies:
[MATH: 〈Mi(p1),Mi(p2)〉=tr{Mi(p1)Mi(p2)}≤‖Mi(p1)‖F<
/msub>‖Mj
mi>(p2)‖F.
:MATH]
3
For two infection stages p1 and p2 the distance metric is defined as
[MATH:
f1=DistMi,Mj=1−tr{Mi(p1)}tr{Mj(p2)}‖<
mrow>Mi(p1)‖F<
/msub>‖Mj
mi>(p2)‖F, :MATH]
4
where tr. represent the trace operator and ||.||[F] represents
Frobenius norm. The metric
[MATH:
Dist
Mi,Mj
:MATH]
represents the distance between two correlation matrices M [i] and M
[j]. It becomes zero when the two correlation matrices are equal, and
is equal to one if the matrices differ maximum amount. The distance is
motivated from^[53]21, where the metric is used to characterize the
change of spatial structure of multiple-input multiple-output (MIMO)
channels.
For building the second objective we utilize module eigenegene^[54]16
based measure. Module Eigenegene is generally considered as a
representative of the whole module. It is defined as the first left
singular vector of the expression matrix corresponding to the module.
Suppose X ^(q) represent gene expression data corresponding to the
module q. The singular value decomposition of X ^(q) (m × n) provides
three matrices U, V and D as follows: X ^(q) = UDV ^T where U(u [1], u
[2], … u [min(m,n)]) and V(v [1], v [2], … v [min(m,n)]) are the
matrices of left and right singular vectors and D(d [1], d [2], … d
[min(m,n)]) is a diagonal matrix containing singular values. The first
column of U ^(q) (i.e., u1^(q)) is referred to as module eigengene.
Module eigenegene generally explains the highest amount of variation in
the module expression data. The Pearson correlation between expression
profile of a gene and a module eigengene is designated as the module
membership value of that gene. For a module, the module membership
values of all genes are calculated for two infection stages. Let,
[MATH:
MM_gimip :MATH]
represents the module membership value of a gene g [i] of module m [i]
at infection stage p. For each of the module we have computed this
metric for two infection stages p1 and p2. Particularly, for each gene
in a module we have computed the following metric:
[MATH:
diff_MM<
mi>_gip1,p2=|MM_gimip1−MM_<
mrow>gim
ip2| :MATH]
5
This signifies the absolute difference between module membership value
of a gene in two different infection stages. We take the average of all
these values inside a module. Thus our second objective function is
defined as
[MATH:
f2=∑Mi∈M
∑gi
∈Mi
mrow>diff_MM_gi<
/mi>p1,p2N, :MATH]
6
where, M is the module set, and N represent number of modules.
Testing of objective functions
To test the objective functions are capable for detecting the
differential coexpression pattern we have performed an analysis. For
each of the metric we have investigated the distribution of DC_Score
which are believed to be higher for differential coexpression modules.
Figure [55]1 shows the distribution of objective function values with
DC_Score. Here we see that there exist a strong correlation between
f1(or f2) with DC_Score. The way we define the first objective f1 is
very straightforward. For a module, it compute the distance between two
correlation matrices which comes from two different infection stages.
Note that f1 actually mimics the value of DC_Score in a different way.
Figure [56]1 also proves the same statement. This is also true for the
other objective function f2. Module membership of a gene in a module
characterizes the overall expression similarity with the other genes
within the module. For a gene, if this value differ significantly
across two infection stages, it actually reflects the change of
expression similarity with the other genes within the module. The
increasing value of f2 with DC_Score also proves this statement.
Figure 1.
Figure 1
[57]Open in a new tab
Figure shows the plot of DC_Score vs objective function values. Values
of both objective functions increase with the DC_Score values.
Results
In this section we describe the performance of our proposed method with
that of some state-of-the-art technique like DiffCoEx, CoXpress and
CLICK.
Description of Dataset
We downloaded the series [58]GSE6740 dataset from GEO database
([59]http://www.ncbi.nlm.nih.gov/geo/) for HIV-1 expression data. The
dataset consist stage specific gene expression of human CD4+ and CD8+ T
cells. It is prepared by examining the gene expression profiles in ex
vivo human CD4+ and CD8+ T cells from untreated HIV-infected
individuals at different clinical stages of disease progression^[60]6.
The dataset consists of 22284 genes with 10 samples in each stage. The
platform used for the analysis is Affymetrix Human Genome U133A Array.
Some genes are represented by multiple probe sets. Also the other probe
sets are not fully annotated, so for consistency we refer to probe sets
as “genes” throughout the article. A gene is considered expressed if it
is called as “present” or “marginal” in all of the samples in the given
dataset. This is determined by “mas5calls” function (Bioconductor
“affy” package) in R. It performs the Wilcoxon signed rank-based gene
expression presence/absence detection call to determine whether the
transcript of a gene is detected (present) or undetected (absent). This
results 3829 and 3704 expressed genes in acute and chronic stage,
respectively. We select 2828 common genes among the expressed genes for
our analysis. The scenario is shown in Fig. [61]2.
Figure 2.
Figure 2
[62]Open in a new tab
Venn diagram showing the overlaps of expressed genes in acute and
chornic stages. 2828 genes are common between 3829 expressed genes of
acute stage and 3704 expressed genes of chronic stage.
It can be noted that each stage consists of ten samples, five samples
are for CD4+ T cells and five samples are for CD8+ T cells. To know
whether there is any difference in the expression patterns of CD4+ and
CD8+ T cells in each stage, we perform an analysis. First, we separate
the CD4+ and CD8+ samples of acute and chronic stage. For each stage,
we plot the distribution of mean expression values of all expressed
genes in CD4+ and CD8+ samples. Figure [63]3 (panel-(a) and panel(b))
shows the distribution. Moreover, we investigate the distribution of
all CD4+ and CD8+ samples. Figures [64]4 and [65]5 show the density
plot and box plots of all the samples for acute stage and chronic stage
respectively. It appears from the figures that there is no such
distinguishable change of expression patterns between CD4+ and CD8+
samples in acute and chronic stages.
Figure 3.
[66]Figure 3
[67]Open in a new tab
Figure shows the distribution of mean expression values of all
expressed genes in CD4+ and CD8+ samples in acute and chronic stages.
Figure 4.
[68]Figure 4
[69]Open in a new tab
Comparison of expression values of CD4+ and CD8+ samples in acute
stage. Panel (a) shows the distribution of five samples of CD4+ cell
([70]GSM154936, [71]GSM155180, [72]GSM155182, [73]GSM155184 and
[74]GSM155186) and five sample of CD8+ cell ([75]GSM155179,
[76]GSM155181, [77]GSM155183, [78]GSM155185 and [79]GSM155187). Panel
(b) shows the boxplots of the corresponding samples.
Figure 5.
[80]Figure 5
[81]Open in a new tab
Comparison of expression values of CD4+ and CD8+ samples in chronic
stage. Panel (a) shows the distribution of five samples of CD4+ cell
([82]GSM155189, [83]GSM155192, [84]GSM155200, [85]GSM155202 and
[86]GSM155204) and five sample of CD8+ cell ([87]GSM155190,
[88]GSM155195, [89]GSM155201, [90]GSM155203 and [91]GSM155206). Panel
(b) shows the boxplots of the corresponding samples.
Comparing DiffCoMO with some state-of-the-art
Here, we compare the performance of our proposed method with that of
some other method like CLICK, DiffCoex and CoXpress. To assess the
performance of all the methods and conduct a fair comparison we compare
the ability of the methods to detect Differential coexpression pattern
within each modules.
We applied CLICK, CoXpress, DiffCoex and proposed method DiffCoMO on
the acute and chronic stage of HIV-1 expression dataset and find set of
differentially coexpressed modules in each case. We inspect absolute
change in correlation of each detected modules. For this purpose we
calculate DC_Score of identified modules of all the methods and plot
these in Fig. [92]6(a,b). The left panel of the figure shows the
distribution of DC_Score with value 0.35 or above with the fraction of
identified modules for DiffCoex, CoXpress, CLICK, DICER and the
proposed method DiffCoMO. The right panel shows the distribution of
DC_Score for the five state-of-the-arts.
Figure 6.
[93]Figure 6
[94]Open in a new tab
Distribution of DC_Score - the left pane show the fraction of
identified modules having DC_Score score above 0.35, while the right
pane show the distribution of DC_Score for DiffCoex, CoXpress, CLICK
and the proposed DiffCoMO algorithm.
From the Fig. [95]6 panel (a) it is clear that DiffCoMO has higher
proportion of modules having better DC_Score. Moreover, from Fig. [96]6
panel (b) it can be also noticed that all the modules in DiffCoMO have
DC_Score comparatively better than the other competing methods.
The metric DC_Score represents average DC_values between all pair of
genes within a module. To know the extent of differential coexpression
between all pair of genes within each module we plot the distribution
of DC_values of all the genes. Here, we have compared the proposed
technique with the method proposed in ref. [97]22. By following the
same methodology proposed in ref. [98]22 we compute differentially
coexpressed gene pairs across acute and chronic stage of infection. We
plot the distribution of DC_values of all the gene pairs obtained from
the analysis and compared this with DiffCoMO. Figure [99]7 shows the
distributions. It is evident form the figure that DiffCoMO also has the
capability to detect differentially coexpressed gene pairs across two
infection stages of disease progression.
Figure 7.
Figure 7
[100]Open in a new tab
The upper pane of the figure shows the distribution of DC_values of all
the gene pairs belonging to the differentially coexpressed modules
obtained from DiffCoMO. The lower pane shows the same for the method
proposed in ref. [101]22.
Identified modules are statistical significant
To investigate whether the difference in correlation pattern in
identified modules is statistically significant, we have performed a
statistical test. For this, we have used the Jennrich test^[102]23
which compares equality of two matrices. It identifies the differences
between two correlation matrices to the averages of these two using a
chi square test. Using this test we have compared two correlation
matrices corresponding to acute and chronic stage, of differential
coexpression modules. The distribution of p-values for different
algorithms are shown in Fig. [103]8. It appears from the figure that
the DiffCoMO modules show lower p-values (higher −log(p)) than the
other state-of-the-art. Almost all the modules of DiffCoMO have p-value
less than 0.01 which proves that the difference in correlation pattern
is identified effectively.
Figure 8.
Figure 8
[104]Open in a new tab
Figure shows the distribution of p-values obtained from Jennrich test,
for DiffCoex, CoXpress, CLICK and the proposed DiffCoMO algorithm. For
better visualization we plot the distribution of −log(p) values
obtained from different methods.
Measuring performance in a simulated dataset
We use simulated data to evaluate the performance and investigate the
correctness of our algorithm to capture the differential coexpression
modules. For this purpose, we have created 5 sets of different
coexpression networks from the original set, where each set contains
two networks corresponding to two different stages. The networks are
built by adding random values generating from normal distribution (mean
(μ = 0)) with different standard deviation (SD) values with the
original network. For a set, two different networks are cerated with a
fixed SD value. In a set, to create the difference between the two
generated coexpression networks, most of the cases we keep opposite
sign of random numbers across the two networks. Thus we have generated
5 different sets of these simulated networks by varying the SD value
from 0.1 to 0.5. Thus, the networks in a set, surely have strong
difference in their correlation patterns. As the SD increases, larger
random values are added to the correlation matrices, as a result the
difference of the two correlation matrices are subsequently increases.
So, for each SD value we have a pair of correlation matrices with
weights corresponding to the random values generated by the procedure
described above. Table [105]1 shows the average DC_Score at different
SD values of all modules extracted from the simulated data by using the
five different algorithms. It can be noticed from the table that the
average DC_Score of all modules corresponding to each network gradually
increase with higher value of SDs. This is obvious, as with the
increment of SD values, the weights of both the network which generally
represent correlation value is also increased, thus average DC_Score of
all predicted modules moderately increased. For better visualization we
have plot the distribution of DC_Score for each method at different SD
values. Figure [106]9(a) shows the distributions for SD value 0.5. The
distributions for other SD values are keep in Supplementary Figure.
Figure [107]9(b) also shown the DC_Score distributions for DiffCoMO at
different SD values. From the Fig. [108]9(b) it is clear that our
proposed method DiffCoMO behave correctly with respect to the simulated
data. Figure [109]9(a) shows DiffCOMO has the ability to detect
differential coexpression pattern better than the other methods in
simulated data.
Table 1.
Performance of DiffCoMO nad DiffCoEx with respect to simulated dataset:
average DC_Score of all modules identified from simulated
differentially coexpressed network with increasing SD value.
SD
0.1 0.2 0.3 0.4 0.5
DiffCoex 0.5296 0.6060 0.6960 0.7543 0.83503
CLICK 0.4385 0.4762 0.6608 0.7174 0.8086
CoXpress 0.4439 0.5855 0.6795 0.7294 0.8238
DICER 0.5682 0.6029 0.7149 0.7851 0.8868
DiffCoMO 0.6369 0.6517 0.7653 0.8333 0.9258
[110]Open in a new tab
Figure 9.
[111]Figure 9
[112]Open in a new tab
Panel (a) shows distributions of DC_Score of the identified modules for
different algorithms at SD value 0.5. Panel (b) shows the distribution
DC_Score of DiffCoMO modules at different SD values.
We have also investigate the performance of DiffCoex in the same
simulated data. The average DC_Score of all extracted modules are
listed in the Table [113]1.
Biological validation of modules
In this section we biologically validate the differential coexpresed
modules identified by the proposed DiffCoMO algorithm. For this purpose
we have performed a Gene Ontology and pathway based analysis of the
identified modules. Moreover, as coexpression results from coregulation
among the genes, so change in coexpression may effect the regulation
pattern also. So, we have performed an analysis to identify
transcrption factors (TF) that are targets of some miRNA families. As
the microRNA families are generaly known to be associated with some
specific disease so we have also performed an analysis to identify the
disease association with selected miRNA families associated with
different transcription factors which in tern involved in some of the
identified modules.
GO and Pathway enrichment
We have investigated what extent of Gene Ontology terms and pathways
are associated with the identified differential coexpression modules.
For this purpose we collected the GO terms from GO database and able to
associate these terms with identified modules. From KEGG database, we
also identified significant pathways that are involved in different
differential coexpression modules. Table [114]2 shows significant
GO-terms and pathways discovered from the identified modules. A careful
observation on Table [115]2 reveals that some of the identified GO
terms are common among the modules. This is because of the overlap in
the identified modules. To identify the overlaps we compute a overlap
score between each pair of identified modules. Overlap score between a
pair of modules is defined as the number of common human proteins
divided by the total number of unique human proteins in these modules.
In Supplementary Table [116]1 we have shown the overlap scores among
the identified differentially coexpressed modules.
Table 2.
GO-terms and KEGG pathway of some identified differentially coexpressed
modules.
module(Sl No.) #genes DC_Score GO term(bp) KEGG pathway
1 46 0.8216 positive regulation of molecular function (GO:0044093)
(1.7E-5) Alzheimer’s disease (1.0E-3)
2 93 0.8168 generation of precursor metabolites and energy
(GO:0006091)(1.3E-5) Alzheimer’s disease (1.1E-5)
3 102 0.8131 generation of precursor metabolites and energy
(GO:0006091)(9.1E-6) Parkinson’s disease (3.4E-3)
4 114 0.8070 positive regulation of ubiquitin-protein ligase activity
during mitotic cell cycle(GO:0051437)(8.1E-5) Alzheimer’s disease
(6.6E-6)
5 126 0.8093 lymphocyte differentiation (GO:0030098) (1.5E-2) Thyroid
cancer (1.6E-2)
6 156 0.7302 anaphase-promoting complex-dependent proteasomal
ubiquitin-dependent protein catabolic process(GO:0031145)(5.4E-8)
Parkinson’s disease(4.5E-6)
7 143 0.7260 DNA unwinding during replication(GO:0006268) (1.7E-4)
Intestinal immune network for IgA production (1.0E-3)
8 120 0.7100 mRNA metabolic process (GO:0016071) (2.2E-5) Systemic
lupus erythematosus (4.2E-2)
9 106 0.7003 anaphase-promoting complex-dependent proteasomal
ubiquitin-dependent protein catabolic process (GO:0031145) (4.7E-6)
Proteasome (1.4E-5)
[117]Open in a new tab
In Table [118]2 we show only the relevant GO-terms and pathways of
those modules which have DC_Score greater than 0.7. From Table [119]2
it is noticeable that the pathways of Alzheimer’s disease and
Parkinson’s disease are often associated with different differential
coexpression modules. In ref. [120]24 it is demonstrated that
HIV-infected peripheral blood mononuclear cells (PBMCs) show
overrepresentation of neurodegenerative pathways (Alzheimer’s,
Parkinson’s, ALS, Huntington’s and Prion Disease, etc). Here it is also
suggested that this overrepresentation together with genome wide
mapping of host gene expression may act as an indicator of possible
neurological deterioration in HIV patients.
The modules are also enriched with gene sets that are belonging to
different disease and disorder associated pathways like Systemic lupus
erythematosus (an autoimmune disease) and Intestinal immune network for
IgA production. Systemic lupus erythematosus (SLE) is a prototypic
autoimmune disease caused by the malfunctioning of immune system which
mistakenly attacks healthy tissue. Although SLE is rarely associated
with HIV infection^[121]25, but earlier in ref. [122]26 two cases are
reported where HIV infection causes SLE. Intestinal immune network for
IgA production which is responsible for defense against microorganism
by generating noninflammatory immunoglobulin A (IgA) antibodies, is
also demonstrated to be associated with HIV infection^[123]27. One
module is associate with the pathway Thyroid cancer. Abnormal thyroid
function test results are commonly reported in HIV infection
individuals^[124]28. In ref. [125]29 a case is reported where HIV
infection cause thyroid medullary carcinoma.
Moreover, the modules are enriched with different significant GO-terms.
For example module 1, 2, 8, and 9 are involved with similar type of GO
terms. The genes in those modules are primarily involved with the
activation or increase of some biological activity occurring at some
molecular level like increase of enzyme activity, increases the rate of
ubiquitin ligase activity or increases the rate of catalysis or
binding.
To know whether the identified modules are associated with HIV specific
important pathways we have performed an analysis. For this purpose, we
have collected 11 HIV specific top ranked pathways form Chen et al.
^[126]30. Next, we have calculated proportion of identified modules
that are associated with the pathways. We called a module is associated
with a pathway if at least one gene of this pathway is belonging to
this module. Figure [127]10 shows a bar diagram showing the association
of modules with the pathways. It can be seen from the figure that some
pathways like ‘Apoptosis’ and ‘T cell receptor signaling’ are
associated with more number of identified modules than the other.
Figure 10.
Figure 10
[128]Open in a new tab
Bar diagram showing the proportion of modules associated with 11 HIV
specific pathways.
Hence, it is evident that the identified differentially coexpressed
modules are significant and biologically meaningful.
To test whether the genes in identified modules are belonging to the
same functional group we collected the p-value of individual modules
identified in each of the methods. The p-value of a gene module signify
the probability of observing at least x number of genes out of total n
genes in the module annotated to a particular GO terms, given that the
proportion of genes in the whole genome are annotated with that GO
terms. The p-value is computed by comparing the GO terms shared by the
genes in the module to the background distribution of annotation. So a
p-value of a module closer to zero signifies that it is less likely to
observe the annotation of a particular GO term to a group of genes
occurs by chance. In Fig. [129]11 we plot a bar diagram which shows the
distribution of modules at different p-values. From Fig. [130]11 it is
depicted that large proportion of modules produced by other algorithms
have higher p-values in comparison with DiffCoMO in which a significant
number of modules tend to have smaller p-values (i.e. larger −log(p)).
This establishes that modules identified by DiffCoMO share common
biological functions and more biologically significant than the modules
discovered by other algorithms.
Figure 11.
Figure 11
[131]Open in a new tab
Bar diagram showing the distribution of modules at different p-values.
p-value of a module closer to zero signifies that it is less likely to
observe the annotation of a particular GO term to a group of genes
occurs by chance. X-axis represents −log(p) whereas Y-axis represents
proportion of modules identified by four algorithms. When p-value is
lower −log(p) approaches to higher value.
miRNA enrichment
In general, change in gene coexpression patterns has strong correlation
with coregulation of gene expression. So, differential coexpression of
a pair of gene may be an effect of changing regulation patterns of some
transcription factors. Moreover, miRNAs have an impact on HIV
replication by affecting the expression of host genes which are
essential in the replication process^[132]31. It may possible that TFs
bind some miRNAs which in tern affect the expression level of genes
that are required for viral replication. So, it is essential to
investigate the association between TF to miRNA, miRNA to TF regulation
and miRNA to gene regulation.
To investigate TF to miRNA connections we utilize a putative TF-miRNA
repository PuTmir^[133]32. PuTmir stores the information about the
regulatory activity of miRNAs in their upstream region (USR) as well as
downstream region (DSR). We choose to use PuTmir because it keeps the
information about TFs which binds to the DSR of miRNAs. Apart from
PuTmir we have also searched existing literatures^[134]33, [135]34 to
find more TFs that show altered expression in HIV infection. We found
total 12 TFs which are associated with the identified modules. We look
through the PuTmir database to identify the regulated miRNAs in which
the identified TFs bind. We collect the miRNA list for the
corresponding TFs which binds to the DSR and USR, separately. To get a
better visualization of the identified TFs and associated miRNAs we
create two networks between TFs and regulated miRNAs based on the two
binding regions (USR and DSR) of miRNAs. Figure [136]12 shows these two
bipartite network. From this figure we see that the identified TFs bind
126 miRNAs in the USR and 150 miRNAs in DSR. Here, the red diamond
shape nodes represent TFs and yellow color nodes represent the
associated miRNAs. The Venn diagram in Fig. [137]13 shows that 20
miRNAs are common between the two list of miRNAs. Thus, TFs binds in
both USR and in DSR of these 20 miRNAs. The associated miRNAs also
includes miR-29-b, miR-29-a, miR-28 and miR-125 which are demonstrated
to play an important role HIV replications^[138]31. It^[139]35 it is
established that human miR-29-a and miR-29-b are expressed in
Peripheral Blood Mononuclear Cells (PBMC). These miRNAs are responsible
for downregulating the expression of HIV-1 protein Nef^[140]35. In ref.
[141]36 it is also demonstrated that miR-28 and miR-125 which are found
in resting primary CD4+ T cells target 3′ ends of HIV-1 messenger RNAs,
thus inhibits HIV-1 production in resting primary CD4+ T cells.
Figure 12.
[142]Figure 12
[143]Open in a new tab
Tf-miRNA network. Here the red triangle shape nodes represent TFs
identified from the differential coexpressed modules. The pink circle
represent miRNAs regulated by those TFs in their upstream region (USR)
(shown in panel (a)) and in downstream region (DSR) (shown in panel
(b)).
Figure 13.
Figure 13
[144]Open in a new tab
Venn diagram showing the overlaps of two miRNA list.
For investigating the miRNA to TFs connections we have used a
experimentally validated miRNA-target interaction database
miRTarBase^[145]37. We have found 20 TFs in our identified modules,
which are regulated by miRNAs. Some non-TF genes in the modules are
also found to be regulated by the same set of miRNAs. We have listed
all the miRNA-TF and miRNA-non-TF interactions in Supplementary
Table [146]2 and Supplementary Table [147]3. It is noticed from the
tables that the identified modules are enriched with differentially
coexpressed genes that are targeted by some miRNAs like miRNA-28,
miRNA-29, miRNA-125 families. It is established that miR-29 families
are responsible for suppression of HIV infection^[148]31. It is also
verified that miRNA-28 and miRNA-125 families target the 3′ UTR of
HIV-1 transcripts that shifts HIV infection to latency stage^[149]38.
To investigate if the identified TFs have more miRNA connections than
expected by chance we performed a statistical test. For this, we have
collected 300 set of random modules retaining the size of each module
same as original. We count number of connections between miRNAs and
regulated TFs. As we are not aware of the distribution of these
connections so nonparametric test is the best option here. We have
utilized Wilcoxon Ranksum test here. The resulting p-value (8.6733e-10)
is significantly low which indicates that the identified TFs in
original module have more miRNA connections than expected by chance.
Performance of DiffCoMO in expression data with large number of samples
In this section, we have studied the performance of DiffCoMO in
microarray gene expression data having sufficiently large number of
samples. Here, we have downloaded the series [150]GSE18842 dataset from
GEO database ([151]http://www.ncbi.nlm.nih.gov/geo/) which consists 91
non-small cell lung cancer (NSCLC) samples, among them 46 samples are
tumors and 45 samples are controls. We have pre-processed the data by
following the same procedure mentioned in section. Finally, we have
selected 3527 expressed genes for our analysis. We have applied
DiffCoMO to detect differential coexpressed module in this data. We
calculate the DC_Score value of all identified modules and plot this in
Fig. [152]14. Figure shows most of the identified modules have DC_Score
higher than 0.5.
Figure 14.
Figure 14
[153]Open in a new tab
Distribution of DC_Score of all differentally coexpressed modules
identified by DiffCoMO in dataset [154]GSE18842.
Conclusion
In this study we have developed a multiobjective framework DiffCoMO to
identify differential coexpression modules from two microarray dataset
corresponding to two different phenotypes. DiffCoMO operates with two
objective functions. The first one maximizes the distance between two
correlation matrices constructed from two different infection stages.
Second objective function is built using eigengene based measure. It
maximizes the difference of module membership value of gene across two
infection stages. We compared the performance of DiffCoMO with four
other state-of-the-art algorithms: CoXpress, DiffCoEx, Click and DICER.
DiffCoMO performs better than other methods for capturing the
differential coexpression patterns.
We have also measured the performance of DiffCoMO algorithm with
respect to simulated dataset. The simulated study validate the
correctness of our algorithm to capture the differential coexpression
patterns. As random values of opposite sign (mean = 0, standard
deviation (SD) = 0.1 to 0.5) are added to the correlation values of two
matrices, the resulting differential coexpression matrix contains high
values. So, the mean DC_Score of identified modules in these simulated
data are increasing with SD values ranging from 0.1 to 0.5, as
expected. DiffCoMO shows a strong increasing pattern of mean DC_Score
with increasing SD values.
Differential correlation often caused by change in corregulation
patterns of a set of genes by a common regulator or TF. So, we
performed an analysis to find the corregulation patterns of identified
TFs from the extracted modules. The regulation patterns of miRNAs that
are regulated by those TFs are also investigated to see whether these
miRNAs are associated with some specific disease. All the identified
miRNAs are associated with different cancer associated disease. Thus we
can conclude that DiffCoMO can be used to pick out disease specific
miRNA families. The identified TFs also have significantly high miRNA
connection than expected by chance.
In most of the situation the acute phase of HIV-1 infection progresses
to the latent phase (chronic) accompanied with markedly diminishing
CD4+ cell count. This results constitutional symptoms of HIV in human
body that ultimately leads to AIDS. But A small proportion of HIV
infected individual remain clinically stable for a long periods and
have been referred as long-term nonprogressors. So, it is important to
know the extent of differential coexpression changes of modules across
acute to chronic stages along with acute to nonprogressor stage. As in
nonprogressor stage HIV infected individual remain clinically stable
for a long periods so it is important to know the change in regulation
pattern of TFs among the three stages of progression viz., acute,
chronic and non-progressor. We are now working in this direction.
Electronic supplementary material
[155]Supplementary_figure1^ (375.9KB, pdf)
[156]Supplementary_table1^ (345.7KB, pdf)
[157]Supplementary_table2^ (167KB, pdf)
[158]supplementary_table3^ (132.1KB, pdf)
Author Contributions
S.R. did the initial planning, implement the code, write manuscript.
U.M. provided constructive discussions, write and review the
manuscript. All authors reviewed the manuscript.
Competing Interests
The authors declare that they have no competing interests.
Footnotes
Sumanta Ray and Ujjwal Maulik contributed equally to this work.
Electronic supplementary material
Supplementary information accompanies this paper at
doi:10.1038/s41598-017-00090-2
Publisher's note: Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
References