Abstract
Well characterized the connections among diseases, long non-coding RNAs
(lncRNAs) and drugs are important for elucidating the key roles of
lncRNAs in biological mechanisms in various biological states. In this
study, we constructed a database called LNCmap (LncRNA Connectivity
Map), available at [46]http://www.bio-bigdata.com/LNCmap/, to establish
the correlations among diseases, physiological processes, and the
action of small molecule therapeutics by attempting to describe all
biological states in terms of lncRNA signatures. By reannotating the
microarray data from the Connectivity Map database, the LNCmap obtained
237 lncRNA signatures of 5916 instances corresponding to 1262 small
molecular drugs. We provided a user-friendly interface for the
convenient browsing, retrieval and download of the database, including
detailed information and the associations of drugs and corresponding
affected lncRNAs. Additionally, we developed two enrichment analysis
methods for users to identify candidate drugs for a particular disease
by inputting the corresponding lncRNA expression profiles or an
associated lncRNA list and then comparing them to the lncRNA signatures
in our database. Overall, LNCmap could significantly improve our
understanding of the biological roles of lncRNAs and provide a unique
resource to reveal the connections among drugs, lncRNAs and diseases.
Introduction
Long non-coding RNAs (lncRNAs) are transcripts that are longer than 200
nucleotides and are not translated into proteins. Recently, a large
number of lncRNAs have been identified, and increasing evidence shows
that lncRNAs play critical roles in various biological processes and
are engaged in multiple biological mechanisms^[47]1–[48]3, such as
physiological, chromatin modification,
transcriptional/post-transcriptional regulation and human
diseases^[49]4. Aberrant expressions of lncRNAs were thought to play
critical roles in the progression and development of various cancer
types, some of which could be further evaluated as potential
biomarkers. Further, the expressions of lncRNAs would change when
treated with bioactive small molecules. For example, the expression of
lncRNA GAS5 was decreased in SKBR-3/Tr cells and breast cancer tissue
from trastuzumab-treated patients^[50]5, and Lavorgna et al. proposed
that lncRNAs may be a new class of therapeutic target, especially in
cancers^[51]6. Therefore, lncRNAs could be considered genomic
signatures for discovering the “connections” between drugs and
diseases.
Constructing a database to characterize and establish the connections
among diseases, lncRNAs and drugs is a meaningful endeavor. Previously,
RNA-seq was the only comprehensive way to profile lncRNA expression.
However, because of the high cost associated with the use of this
technique, publically available RNA-seq data sets induced by small
molecules are relatively limited compared to array-based expression
profiles. In contrast, the Connectivity Map has a large number of
array-based gene-expression profiles from cultured human cells that
have been treated with bioactive small molecules. Although lncRNAs are
not the intended targets of measurement in the original array design,
microarray probes can be reannotated for interrogating the lncRNA
expression^[52]1, [53]7, [54]8. By repurposing microarray data from the
Connectivity Map database for probing lncRNA expression, we constructed
a database called LNCmap to characterize lncRNA signatures of drugs,
and establish the correlations among diseases, lncRNAs, and the action
of small molecule therapeutics. In the LNCmap database, we repurposed a
total of 5916 Affymetrix microarray raw data instances and obtained 237
lncRNAs signatures of up to 1262 small molecular drugs. The LNCmap
provided a user-friendly interface for the convenient browsing,
retrieval and download the dataset. Additionally, we also provided two
pattern-matching tools to establish the connections between diseases
and drugs in terms of lncRNAs.
Materials and Methods
Data sources
We downloaded raw data files from the Connectivity Map database
([55]http://www.broadinstitute.org/cmap/)^[56]9, and the data referred
to three different platforms (HG-U133A, HT_HG-U133A, HT_HG-U133A_EA).
We obtained 5916 Affymetrix microarrays (.CEL files) corresponding to
1262 bioactive small molecules profiled by two different Affymetrix
microarray platforms: Human Genome U133 Set (HG-U133A) and GeneChip HT
Human Genome U133 Array Plate Set (HT_HG-U133A), which contained 674
and 5242 instances respectively. Due to the absent sequence files of
HT_HG-U133A_EA platform, 184 instances from this chipset were not used
in our work. Drug information (such as ATC code) was obtained from the
DRUGBANK database ([57]http://www.drugbank.ca/) and KEGG drug
([58]http://www.kegg.jp/kegg/drug/).
Repurposing microarray data for probing lncRNA vexpression
We developed a similar computational method to repurpose microarray
data for probing lncRNA expression according to the pipeline of
ncFANs^[59]7, [60]10. The ncFANs proposed by Liao et al. has been
widely used for the functional annotation of long non-coding RNAs in
various studies^[61]10–[62]13, and becomes a popular method to
re-annotate microarray data to obtain high throughput lncRNA expression
profiles. We first collected lncRNA transcript sequences from GenCode
(gencodeV19), and we used BLASTn to align the probe sequences provided
by Affymetrix ([63]http://www.affymetrix.com) to lncRNA transcript
sequences. Alignment results with e-value greater than 10^−6 were
removed, and we filtered the alignment results as follows: (i) set
alignment_length to 25, and probes that perfectly matched to a
transcript with no mismatch were retained; (ii) all probes that
targeted both lncRNA and protein-coding transcripts were removed; (iii)
all lncRNA transcripts corresponding to retained probes were mapped to
the genome and annotated at the gene level; and (iv) lncRNA genes
matched by fewer than three probes were discarded. After these
filtering steps, we used the R package affy to compute expression
values for all of the Cmap instance samples and obtained log2fold
change values between the treatment samples and the corresponding
control samples. Finally, from these two platforms, we obtained
expression values for 237 lncRNAs that were affected by 1262 drugs.
Enrichment analysis
Based on the correlations between drugs and lncRNAs in the LNCmap
database, users can identify candidate drugs for a particular disease
by inputting the corresponding lncRNA expression profiles or an
associated lncRNA list and then comparing them to drug-induced lncRNA
sets (mentioned in Database content). To do this, we provided two
analysis strategies, LncRNA Set Enrichment Analysis (LSEA) and
Over-Representation Analysis (ORA), to establish connections between
diseases and drugs in terms of lncRNAs.
LSEA
Although lncRNAs were thought to elucidate the underlying biological
mechanisms in various biological states, such as disease, or induced
with a variety of chemicals. However, the connections among diseases,
lncRNAs and drugs are not well characterized. Here, we introduce a
novel method, called lncRNA-set enrichment analysis (LSEA), to identify
the drugs’ mode-of-action (MoA) based on lncRNA expression and
establish the correlations among lncRNAs, drugs and diseases.
The inputs of LSEA were the lncRNA expression profile and the label
file of a disease, in which samples should be classified into two
classes (such as normal and disease), labeled 0 or 1, respectively.
Following the pipeline of the Gene Set Enrichment Analysis
method^[64]14, in LSEA, we obtained a ranked list L of lncRNAs by
computing the lncRNA expression values, and we calculated an enrichment
score (ESi) for each drug-induced lncRNA set i as follows: by walking
down the list L, we increased the running-sum statistic when we
encountered a lncRNA that was in drug-induced lncRNA set i, and
decreased it when we encountered lncRNAs that were not in set i, ESi
was the maximum deviation from zero encountered in the random walk.
Given a query lncRNA expression profiles, LSEA checked for each
drug-induced lncRNA set whether lncRNAs of this set tended to be
significantly ranked at the top (or bottom) of the list. This method
derived its power by focusing on lncRNA sets, which were likely to be
affected by the same drug. LSEA can be considered another type of GSEA:
in GSEA, each pathway is considered a set of genes; in LSEA, the lncRNA
is considered a “gene” and each drug-induced lncRNA set is considered
as a “pathway”. The output of LSEA was a ranked list of drug-induced
lncRNA sets represented by drug names.
ORA
We developed another method to establish the connections between
diseases and drugs based on the list of lncRNAs, according to the
classic over-representation analysis (ORA). This could assess the
statistical overrepresentation between a user-defined, pre-selected
lncRNA list of interest and reference drug-induced lncRNA sets. The
input of ORA was a list of lncRNAs (e.g., differentially expressed
lncRNAs related to a special disease), and the hypergeometric test was
used to calculate the statistical significance for each drug-induced
lncRNA set. The p-value can be calculated to evaluate the enrichment
significance for each lncRNA set as follows:
[MATH: p=1−∑x=0r−1(tx)(m−tn−x)(mn) :MATH]
Here, we collected m total lncRNAs, of which t were involved in the
drug-induced lncRNA set, and the input lncRNA list contained n lncRNAs,
of which r were involved in the drug-induced lncRNA set. After
calculating the p-value, we adopted the FDR-corrected q-values to
reduce the false positive discovery rate. The output of ORA was a
ranked list of drug-induced lncRNA sets represented by drug names.
Results
Database content
The LNCmap was designed to establish the connections among diseases,
lncRNAs and drugs. The flowchart of the LNCmap is shown in Fig. [65]1.
We first downloaded the raw data from the Connectivity Map database. By
reannotating the microarray data for lncRNAs, we obtained the lncRNA
expression profiles that had been treated with small molecular drugs.
Then, we matched the perturbation and control pairs of expression
profiles for each instance (experiment) according to the instances
description file “cmap_instances_02.xls” and calculated log2fold change
values between the treatment samples and the corresponding control
samples for each instance. We provided a flexible threshold to define
differentially expressed lncRNAs (DELs), which can be considered
drug-affected lncRNAs. With fold change ≥2 (or fold change ≤1/2), we
obtained 173 lncRNAs that were affected by 1005 small molecular drugs,
corresponding to 2147 instances, and with fold change ≥1.5 (or fold
change ≤2/3), we obtained 237 lncRNAs and 5523 instances belonging to
1262 small molecular drugs. All of the drugs and affected lncRNAs were
restored in the LNCmap database according to the original instance ID.
Additionally, we collected the classification information from the
Anatomical Therapeutic Chemical (ATC) classification for these small
molecular drugs, and we provided integrated information, such as the
drug name, lncRNA Ensemble ID, log2fold change values and instance ID.
The LNCmap provided a user-friendly interface to implement retrieve,
browse and download functions based on these data. Additionally, the
drug-affected lncRNAs were merged if the corresponding instances
belonged to the same drug (bioactive small molecule); these lncRNAs
were defined as drug-induced lncRNA sets, which were also restored in
the LNCmap database and used for LSEA and ORA enrichment analysis.
Figure 1.
Figure 1
[66]Open in a new tab
Schematic data flowchart of LNCmap.
Enrichment analysis
We developed two enrichment analysis algorithms (LSEA and ORA) to
establish the connections between diseases and drugs in terms of
lncRNAs.
Users could flexibly select the LSEA or ORA method, both the results
were provided as ranked list of drugs with drug-induced lncRNA sets and
could be downloaded from the result page. Top-ranked drugs may be used
to guide the use of drugs for disease. We used primary colorectal
cancer data (SRP029880) as example to perform LSEA and ORA enrichment
analysis. With the ORA method, we input a list of differentially
expressed lncRNAs related to primary colorectal cancer and obtained a
table (Fig. [67]2a; Supplementary dataset [68]1) that included the drug
name (instance ID), ATC code (drug name), drug-induced lncRNAs,
overlapped lncRNAs, p-value and FDR q-value. Drug information can be
found at [69]https://www.ncbi.nlm.nih.gov/pccompound by clicking on
“DrugName” and lncRNA details can be found at
[70]http://asia.ensembl.org/ by clicking on the lncRNA hyperlink. Of
top-ranked drugs, there were some known anti-cancer activity compounds.
For example, disulfiram^[71]15 and sirolimus^[72]16 are used for
colorectal cancer treatment, and fendiline^[73]17 is an anti-cancer
drug that is used to treat pancreatic cancer. In the LSEA method, we
input an expression profile and a label file of primary colorectal
cancer, and the LSEA result was displayed as a table (Fig. [74]2c;
Supplementary dataset [75]2) that contained the drug name (or instance
ID) ranked by p-value, overlapped lncRNAs, ATC code (or drug name), ES,
NES, normal p-value, FDR q-value, and FWER p-value. Drug information
can be found at [76]https://www.ncbi.nlm.nih.gov/pccompound by clicking
on “DrugName”, overlapped lncRNAs can be found by clicking on the
“number of overlapped lncRNAs” and an overview picture of compared
results can be displayed by clicking on the hyperlink “view”. Detailed
analysis results, including statistics, plots and report files of the
significantly enriched drugs, were also provided as a zip file for
users to download by clicking on the “Download Results” hyperlink in
result page (see “Run Example” at
[77]http://www.bio-bigdata.com/LNCmap/lsea). Of the top-ranked drugs,
apigenin^[78]18 is used for colorectal treatment, and lycorine^[79]19
is an anti-cancer drug used for prostate cancer treatment. Specially,
we discovered some meaningful drug-lncRNA-disease correlations (e.g.,
puromycin-NEAT1-colorectal cancer). Puromycin was one of the top-ranked
drugs in the LSEA results, and we verified that puromycin^[80]20 was
used for colorectal treatment by searching the literature. We found
that the expression of lncRNA NEAT1 (ENSG00000245532) in the LNCmap
database significantly changed after treatment of puromycin. Meanwhile,
NEAT1 was related to tumor differentiation, invasion and metastasis in
colorectal cancer^[81]21. Furthermore, this puromycin-NEAT1-colorectal
cancer correlation was verified by real-time PCR experiment (see
Supplementary Fig. [82]S1, Supplementary Information). Therefore, these
results not only provided insight into drug repositioning but also
helped explain the lncRNA signatures to discover the “connections”
between drugs and diseases.
Figure 2.
Figure 2
[83]Open in a new tab
Display of LNCmap website functions. (a) The ORA analysis results. (b)
The browsing of the LNCmap dataset. (c) The LSEA analysis results. (d)
The search results of LNCmap.
Additionally, we implemented KEGG ([84]http://www.genome.jp/kegg/)
pathway enrichment analysis for primary colorectal cancer example of
LSEA and ORA. Pathway enrichment analysis was based on the co-expressed
protein-coding genes of drug-affected lncRNAs using SubpathwayMiner
tools (details of the enrichment procedure are provided in
Supplementary Information). In the enrichment results (Supplementary
dataset [85]3), many significantly enriched pathways were dysregulated
in colorectal cancer cells, which further confirmed the LSEA and ORA
results. For example, the MAPK signaling pathway (path:04010) regulated
intrinsic resistance to the bromodomain and extra-terminal domain
family proteins inhibitors in colorectal cancer^[86]22. Ye et al. found
that down regulated lncRNA CLMAT3 promotes the proliferation of
colorectal cancer cells by targeting regulators of the cell cycle
pathway (path:04110)^[87]23, and low proteasome (path:03050) activity
was related to treatment resistance in colorectal cancer^[88]24.
Database architecture and web interface
LNCmap was implemented using the JavaEE framework and deployed on a
Tomcat 6.0 web server. All database content was stored in a MySQL5
relationship database management system; the server-side was
implemented with Java 1.7 scripts, and the web server was written in
JSP. The LSEA algorithm was implemented using the core code of GSEA in
R language. Due to its considerable running time, we chose a
synchronous technology by packaging the analysis work as a backstage
job and responding immediately with the job id. Users can use the
linked unified resource locator (url) that contains the id to monitor
the job’s completion, and the url will navigate to the result page when
the job is complete. LNCmap allows users to access all of the key
features of the web application through their mobile device. Here, we
provided an intuitive and user-friendly interface to browse and search
the database. The LNCmap browser was developed to view the
drug-affected lncRNAs (Fig. [89]2b), their expression in log2fold
change values and other instance information simultaneously, and the
details of lncRNAs were provided by clicking on the lncRNA hyperlink.
The LNCmap search toolkit offers various methods for querying the
database. Users can acquire the drug-affected lncRNAs record by
querying any given lncRNA or drug or both a lncRNA and a drug against
the database. The search result is displayed by default as an overview
table that summarizes the drug-affected lncRNAs and the corresponding
instance information (Fig. [90]2d). Details of lncRNA and drug
information are supported by the links in the table. Expression values
are offered in fold change values as log-ratios with threshold of ±0.58
(i.e., fold change ≥1.5 or fold change ≤2/3). The complete query result
data can be downloaded to local computers from the download links in
the lower panel. In addition, LNCmap provided the ability to download
all of the data, such as lncRNAs (Supplementary dataset [91]4), drugs,
relationships between drug and affected lncRNAs, drug-induced lncRNA
sets (Supplementary dataset [92]5) used for enrichment analysis, from
the Download Page.
Discussion
In this study, we constructed a database called LNCmap that established
the correlations among diseases, small molecules, and lncRNA
signatures. We first applied a computational method to repurpose
microarray data collected from Cmap for probing lncRNA expression and
identified drug-affected lncRNAs with differentially expressed values
of fold change ≥2 (≤1/2) or fold change ≥1.5 (≤2/3) according to
instance. Then, we merged drug-affected lncRNAs if the corresponding
instances belonged to the same drug and defined as drug-induced lncRNA
sets. These drug-induced lncRNA sets were then used for enrichment
analysis to identify the drugs that may affect the corresponding
disease. We also integrated information of instances and the ATC
classification of drugs in the database and provided a user-friendly
interface to freely retrieve, browse and download this information.
Our study characterized the connections of diseases, lncRNAs and drugs
for the first time. To do this, we also developed two enrichment
analysis algorithms (ORA and LSEA). ORA is a classic gene set
enrichment analysis method. Here, we used the ORA to assess the
statistical overrepresentation of a user-defined, pre-selected lncRNA
list of interest in a reference list of known drug-induced lncRNA sets
using the hypergeometric test. In contrast to ORA, LSEA incorporates
expression level measurements and provides different analysis results.
The enrichment analysis results showed candidate drugs for particular
disease. If users were interested with some drugs and lncRNAs, they can
further verify the result by experiments (e.g., quantitative real-time
PCR). Users can flexibly select any methods to analyze the lncRNAs of
interest with different demands.
We also noticed that there were some limitations of our current study.
Compared to the tens of thousands of lncRNAs that have been found, we
obtained only 237 drug-affected lncRNAs, and the number of lncRNAs in
our database is thus limited. This is because the lncRNA expression was
probed from traditional HG-U133A and HT_HG-U133A Affymetrix microarray
platforms, from which only hundreds of lncRNAs could be reannotated.
Although next-generation sequencing could identify many more lncRNAs,
the publically available RNA-seq data sets induced by small molecules
are relatively limited. With the development of pharmacogenomics,
sequencing drug-induced lncRNA data are increasing, which will lead to
increase in the quantity of drug-induced lncRNAs and more accurate
correlations among small molecules and lncRNAs. Therefore, our study
may be greatly improved with the development of pharmacogenomics
sequencing.
Electronic supplementary material
[93]Supplementary Information^ (281.9KB, pdf)
[94]SI dataset 1^ (29.5KB, xls)
[95]SI dataset 2^ (57.5KB, xls)
[96]SI dataset 3^ (88.5KB, xls)
[97]SI dataset 4^ (32.5KB, xls)
[98]SI dataset 5^ (558.5KB, xls)
Acknowledgements