Abstract
With next-generation sequencing, the genomic data available for the
characterization of integration sites (IS) has dramatically increased.
At present, in a single experiment, several thousand viral integration
genome targets can be investigated to define genomic hot spots. In a
previous article, we renovated a formal CIS analysis based on a rigid
fixed window demarcation into a more stretchy definition grounded on
graphs. Here, we present a selection of supporting data related to the
graph-based framework (GBF) from our previous article, in which a
collection of common integration sites (CIS) was identified on six
published datasets. In this work, we will focus on two datasets,
IS[RTCGD] and IS[HIV], which have been previously discussed. Moreover,
we show in more detail the workflow design that originates the
datasets.
__________________________________________________________________
Specifications Table
Subject area Computational biology, systems biology
More specific subject area Gene therapy, integrational mutagenesis
analysis
Type of data Table, image, dataset
How data was acquired In silico experiments
Data format Analyzed datasets, analyzed Excel tables, PNG files
Experimental factors Integration sites datasets were analyzed with a
new computational method for common integration sites identification
Experimental features A proposed set of common integration sites from
two published integration sites datasets (see [33][1])
A pathway enrichment analysis is also reported
Data source location Heidelberg, Germany
Data accessibility Data is with this article and in ref. [34][1]
[35]Open in a new tab
Value of the data
* •
The analyzed dataset here provided can be used as benchmark to
compare the results of the graph modeling approach for CIS
identification and analysis implemented in software tools.
* •
Graph modeling approach to the identification of common integration
sites.
* •
Validation of the graph-based framework (GBF) against well-known
datasets.
* •
Detailed illustrated procedure for the identification of CIS via
GBF.
1. Data
The dataset containing the identified CIS from the Retroviral Tagged
Cancer Gene Database (RTCGD) [36][6] is provided in Table 1 Appendix A
and it is obtained by using a Cytoscape 2.8 plugin, which implements
some of the features of the GBF method (see how to retrieve the code in
[37][1]). The other datasets are collected using a normal Internet
browser. [38]Fig. 1 shows a Venn diagram in which two datasets are
compared. The first dataset is the collection of all the genes found
with the GBF method, while the second dataset is the list of genes
provided by RTCGD which uses the standard window method (SWM) to
identify CIS and the next gene approach (NGA) to discover and associate
an annotated gene to the identified CIS. For further details about the
two approaches, see [39][1]. With the GBF method, it is possible to
discover 1421 genes which are not present in the RTCGD dataset. Only
142 genes were not discovered by the GBF method while they are present
in the RTCGD gene list, and 404 of the genes can be found by both
methods.
Fig. 1.
[40]Fig. 1
[41]Open in a new tab
Venn diagram of the gene atmosphere of all identified CIS from the
RTCGD dataset using the GBF (graph-based framework) [42][1] and using
the SWM (standard window method) [43][2].
2. Experimental design, materials and methods
2.1. Experiment workflow
The workflow of the analysis is depicted in [44]Fig. 2. The input is a
dataset composed of a list of integration sites (IS). The graph-based
framework (GBF) presented in [45][1] is adopted to perform all the
following analyses. The first step is the CIS identification and the
computation of some statistics for every CIS. Further steps are
optional but they have to follow the order. The second step consists of
enhancing the CIS dataset with information from genomic annotated data.
This step generates the gene atmosphere (GA) dataset as shown in Table
2 Appendix A. Using the GA dataset, the next step consists of the
functional analysis, as shown in Table 3 Appendix A.
Fig. 2.
[46]Fig. 2
[47]Open in a new tab
Workflow of the full analysis process: starting from the raw dataset to
the functional analysis.
2.2. Data preparation
The dataset used for the analysis should contain few attributes in
order to be properly analyzed by the GBF method. Some of these
attributes are mandatory and they are shown in [48]Table 1. The
mandatory attributes for the CIS enhancing phase are shown in [49]Table
2.
Table 1.
Mandatory attributes of the input dataset for the identification of CIS
using the GBF method.
Attributes Description
Chromosome number The ordinal number of the chromosome in which the
integration event was found
Insertion site position The position on the genome: a very long integer
number representing the base pair where the virus was integrated
Entropy label (e.g. Kind of tumor, virus type) Meta-information used
for the computation of the CIS entropy. It is a label that represents a
factor of the experiment. For example, it could be the tumor model or
type from which the IS has been associated
[50]Open in a new tab
Table 2.
Mandatory attributes of the input dataset for enhancing analysis using
annotated genomic data against the GBF method.
Attributes Description
Chromosome number The ordinal number of the chromosome in which the TSS
of the gene is located
Transcription start site The position on the genome: a very long
integer number representing the base pair where transcription starts at
the 5′-end of a gene sequence
[51]Open in a new tab
2.3. Common integration sites identification
The method presented in [52][1] allows the identification of CIS on the
basis of very few attributes found in the dataset under analysis (see
[53]Table 1). [54]Fig. 3 shows the flowchart of the global method that
builds the model and identifies the CIS with their statistics.
Fig. 3.
[55]Fig. 3
[56]Open in a new tab
Flowchart of the main method for the identification and enhancing of
CIS using the graph-based framework.
Starting from the dataset containing the integration sites (IS
dataset), it is convenient to order the dataset according to the
integration position to improve the algorithm efficiency. This is the
data preparation part ([57]Table 1). Afterwards, as depicted in
[58]Fig. 3, the building of the model starts creating an empty graph.
For every IS present in the dataset, a node is created and added to the
graph. A nested loop checks if all the vertices instantiated in the
graph are at a distance below a certain threshold from the current IS
previously added as a node to the graph itself. An edge connecting two
nodes of the same type (i.e. two IS nodes) is created and added to the
graph if the distance is lower than the threshold. When all the IS from
the dataset are analyzed, the main loop terminates and the graph is
ready to be analyzed by the main algorithm for CIS identification. This
algorithm can be implemented in different ways (e.g. an algorithm that
extracts the connected components (CC) from an undirected and
disconnected graph). An efficient version of this algorithm is
presented in [59][3].
2.4. Common integration sites statistics computation
When the CIS identification is performed, a set of statistics are
computed. The most interesting statistics are presented in [60]Table 3.
For further details about how the statistics have been computed, see
Paragraph 2.6 in [61][1].
Table 3.
Computed statistics for CIS.
Statistic Description
CIS order The total number of IS present in the CIS
CIS dimension The number of base pairs that contain all the IS
belonging to a single CIS (see [62]Section 2.7 for details)
CIS p-value The p-value associated to the CIS. See Paragraph 3.6 in
[63][1] for a comprehensive explanation
CIS entropy The entropy of the CIS based on the label from the input
dataset (e.g. tumor type, virus type). See paragraph 3.6 in [64][1] and
[65]Section 2.7
[66]Open in a new tab
2.5. Common integration sites enhancing
Optionally, an enhancing of the CIS dataset can follow. The purpose is
to link each IS with its neighborhood on the genome retrieving
annotations present in online databases. Here, we used a normal
Internet browser to perform queries accessing annotated data provided
online by the BioMart database [67][4]. The dataset resulting from this
step is shown in Table 2 Appendix A, which provides a list of
transcriptional elements (TE) composing the GA of all CIS identified
with the previous step. As shown in the flowchart in [68]Fig. 3, the
process that builds the GA is similar to the process that build the IS
graph. The IS nodes in the graph are linked with the TE nodes if the
distance on the genome is below a certain threshold.
2.6. Functional annotation using a GA list
If the previous step is performed, a functional annotation using DAVID
[69][5] may follow. This is the last step of the main workflow shown in
[70]Fig. 2. Here, we perform this step using the RTCGD dataset and the
output is shown in [71]Table 3.
2.7. CIS properties computed in the Cytoscape prototype
CIS number
Integer value given to a CIS by the plugin.
CIS name
Name of the CIS as it appears in the tabular exported file. It is a
composition of the chromosome and the CIS number.
CIS order
Number of IS that compose the CIS.
CIS average position
Approximate CIS position p[A] calculated as
[MATH: pA=ISfirst+ISlast2 :MATH]
; IS[first] and IS[last] are the positions on the chromosome of the
first and last IS in the CIS.
CIS median position
Approximate CIS position p[M] calculated sorting the n IS as they
appear on the chromosome:
* (1)
[MATH:
pM=I<
msub>Sn+12 :MATH]
if n is odd or
* (2)
[MATH:
pM=I<
msub>Sn2+
ISn2+1<
/mn>2 :MATH]
if n is even.
IS[(i)] is the position of the ith IS of the CIS. For CIS with an
asymmetric distribution of the IS, this approximation gives a more
precise estimation.
CIS entropy
If the number of different labels (entropy label) found in the CIS is n
and the order is O, the entropy value is computed as
[MATH: ECIS=∑i=1n<
/mi>niOlogniOlogn
mrow> :MATH]
where n[i] is the number of IS labelled with i.
Normalized entropy
If the number of different labels (entropy label) found in the entire
dataset is N and the order of the CIS is O, the entropy value is
computed as
[MATH: NECIS=∑i=1N<
/mi>niOlogniOlogN
mrow> :MATH]
where n[i] is the number of IS labelled with the label i.
CIS p value
See the subsection “Statistical model, p-value and log-likelihood ratio
test” in [72][1]
CIS loglike ratio
See the subsection “Statistical model, p-value and log-likelihood ratio
test” in [73][1]
Footnotes
^Appendix A
Supplementary data to this article can be found online at
[74]http://dx.doi.org/10.1016/j.csbj.2015.11.004.
Appendix A. Supplementary data
Table 1 — Identified CIS and their statistics using the GBF method on
the RTCGD dataset (only CIS of order ≥ 4).
Table 2 — Gene atmosphere of the identified CIS using the GBF method on
the RTCGD dataset. For each gene the transcription start site (TSS) is
reported.
Table 3 — Functional analysis using DAVID. Two gene lists analyzed: the
genes found both with GBF and SWM on the RTCGD dataset, and the genes
found only with GBF.
Table 4 — CIS found on HIV dataset (see [75][1]) only CIS of order
≥ 10.
[76]mmc1.zip^ (466.6KB, zip)
References