Graphical abstract

   [43]graphic file with name ga1.jpg
   [44]Open in a new tab

   Keywords: Long non‐coding RNA, Functional prediction,
   Disease-associated SNPs, Coefficient of variation, WGCNA

Abstract

   The long non‐coding RNAs (lncRNAs) play critical roles in various
   biological processes and are associated with many diseases. Functional
   annotation of lncRNAs in diseases attracts great attention in
   understanding their etiology. However, the traditional
   co-expression-based analysis usually produces a significant number of
   false positive function assignments. It is thus crucial to develop a
   new approach to obtain lower false discovery rate for functional
   annotation of lncRNAs. Here, a novel strategy termed DAnet which
   combining disease associations with cis-regulatory network between
   lncRNAs and neighboring protein-coding genes was developed, and the
   performance of DAnet was systematically compared with that of the
   traditional differential expression-based approach. Based on a gold
   standard analysis of the experimentally validated lncRNAs, the proposed
   strategy was found to perform better in identifying the experimentally
   validated lncRNAs compared with the other method. Moreover, the
   majority of biological pathways (40%∼100%) identified by DAnet were
   reported to be associated with the studied diseases. In sum, the DAnet
   is expected to be used to identify the function of specific lncRNAs in
   a particular disease or multiple diseases.

1. Introduction

   Long non‐coding RNA (lncRNA) is broadly defined as a type of non-coding
   RNA with a length of more than 200 nucleotides [45][1]. Tremendous
   evidences have shown that lncRNA can carry out diverse functions in
   biological processes [46][2] and is associated with many diseases
   [47][3], such as cancers [48][4], cardiovascular diseases [49][5],
   neurodegenerative diseases [50][6], metabolic diseases [51][7], and
   inflammatory diseases [52][8]. Currently, many computational methods
   for predicting lncRNA function have been developed [53][9], for
   instance, the differential expression analysis (DEA) combined with the
   weighted correlation network analysis (WGCNA) [54][10]. This method has
   been frequently employed for identifying co-regulatory relationships
   among lncRNAs and mRNAs in polycystic ovary syndrome [55][11] and
   discovering the cis-regulatory lncRNAs involved in vascular
   inflammation [56][12].

   However, analysis based on co-expression usually results in a large
   number of false positive function assignments [57][9]. Currently, the
   lncRNA-disease association data supported by experiments are quite
   limited in the publications [58][13]. Specifically, only about 6,000 of
   over 90,000 lncRNAs have been characterized by experiments as
   “disease-associated” in human genome [59][14], [60][15]. This may be
   attributed to the complex characteristics of lncRNA, including the
   higher expression variability across disease conditions [61][16],
   [62][17], [63][18], the susceptibility on expression/secondary
   structure to genetic variants [64][19], [65][20], [66][21], and the
   various levels of regulation on the coding genes (cis/trans) [67][2],
   [68][18], etc.

   So far, the analysis considering disease specificity into lncRNA
   functional annotation can improve the discovery of diseased-associated
   lncRNA [69][16]. In particular, lncRNA-disease associations can be
   well-established via the single nucleotide polymorphisms (SNPs) type of
   genetic variants within lncRNAs [70][16] and condition-specific
   analysis estimated by the coefficient of variation (CV) [71][17],
   [72][22]. Moreover, lots of lncRNAs have been reported to regulate the
   expression of their neighboring genes (act in cis) [73][23], [74][24],
   [75][25]. The co-expression of the cis-regulatory lncRNAs and their
   neighboring protein-coding genes led to the discovery of functional
   lncRNAs in given disease [76][26]. It is therefore crucial to develop a
   new approach integrating diseased associations for obtaining lower
   false discovery rate (FDR) [77][16].

   In this study, a novel strategy termed DAnet which combining disease
   associations with cis-regulatory network was developed. In particular,
   disease-associated SNPs were first integrated for screening
   disease-associated lncRNAs. And then the CV of these lncRNAs was
   estimated to assess the condition-specific expression of lncRNAs in a
   specific disease. Moreover, the WGCNA-based co-expression network
   between lncRNAs and their neighboring protein-coding genes and Kyoto
   Encyclopedia of Genes and Genome (KEGG) pathway enrichment analysis
   were further conducted for identifying the function of the lncRNAs
   involved. Furthermore, experimentally verified lncRNA-disease
   associations were curated to evaluate the performance of this newly
   proposed strategy across 24 datasets involving eight types of disease
   based on classification of the ICD-11. Overall, the findings of this
   study can facilitate the discovery of disease-associated lncRNAs and
   their function in the specific disease.

2. Methods

2.1. Collection of the benchmark datasets for the analysis

   For the function analysis of lncRNA in different type of diseases, a
   variety of microarray/RNA-seq data were collected by searching disease
   names in Gene Expression Omnibus (GEO) [78][27] and The Cancer Genome
   Atlas (TCGA) [79][28]. We considered several criteria: (1) the gene
   expression profiling was conducted using high throughput sequencing or
   lncRNA microarray for “Homo sapiens”, (2) the dataset consist of
   patient and control groups, (3) the raw data or normalized data were
   available, (4) the number of lncRNAs identified by disease-associated
   SNPs was more than zero, (5) the experimentally validated disease
   associated lncRNAs, which obtained from 5 public databases (LncRNAWiki
   [80][29], LncRNADisease [81][14], LncRNA2Target [82][30], Lnc2Cancer
   [83][31], and EVLncRNAs [84][32]), were available for the diseases and
   (6) multiple types of disease based on classification of the ICD-11. In
   total, 22 benchmark datasets were collected from GEO and two datasets
   were collected from TCGA, which included 16 diseases, divided into 8
   types of disease according to the classification of ICD-11. Then, the
   lncRNA and mRNA expression matrices obtained from the 24 datasets of
   control-case studies were used for subsequent analysis. [85]Table 1
   demonstrates the disease type (ICD-11 code), dataset ID, the numbers of
   sample, the expression unit, and the number of lncRNAs and mRNAs for
   each dataset.

Table 1.

   Twenty-four datasets of eight disease types were collected for function
   analysis of lncRNA. The first 22 datasets were collected from GEO and
   the last two datasets were collected from TCGA. MDD: major depressive
   disorder; VHD: valvular heart disease; AF-VHD: valvular heart disease
   with atrial fibrillation; SLE: systemic lupus erythematosus; ALL: acute
   lymphoblastic leukemia; TPM: Transcripts Per Million; Normalized: DESeq
   normalized; nRPKM: normalized Reads Per Kilobase of transcript, per
   Million mapped reads; FPKM: Fragments Per Kilobase of exon per Million;
   RPKM: Reads Per Kilobase of transcript per Million reads mapped;
   Normalized signal intensity: Quantile normalization using the
   GeneSpring software.
   Type of Disease Dataset ID No. of Sample in the specific dataset
   Expression Unit (Experiment type) No. of lncRNAs & mRNAs
   8A20 [86]GSE113524 [87][72] 19 Alzheimer disease20 Healthy controls TPM
   (RNA-Seq) 12,937 lncRNAs & 18,969 mRNAs
   8A20 [88]GSE104704 [89][73] 12 Alzheimer disease10 Healthy controls
   Normalized (RNA-Seq) 2,199 lncRNAs & 17,965 mRNAs
   8A20 [90]GSE125583 [91][74] 219 Alzheimer disease70 Healthy controls
   nRPKM (RNA-Seq) 2,803 lncRNAs & 18,852 mRNAs
   6A70 [92]GSE101521 [93][75] 30 MDD29 Healthy controls Normalized
   (RNA-Seq) 11,109 lncRNAs & 18,754 mRNAs
   6A70 [94]GSE102556 [95][76] 26 MDD22 Healthy controls FPKM (RNA-Seq)
   12,718 lncRNAs & 18,793 mRNAs
   6A20 [96]GSE112523 [97][77] 29 Schizophrenia28 Healthy controls Reads
   Count (RNA-Seq) 12,179 lncRNAs & 18,437 mRNAs
   BA41 [98]GSE65705 [99][78] 32 Myocardial infarction2 Healthy controls
   RPKM (RNA-Seq) 1,351 lncRNAs & 17,801 mRNAs
   BA41 [100]GSE127853 [101][79] 3 Myocardial infarction3 Healthy controls
   FPKM (RNA-Seq) 503 lncRNAs & 10,216 mRNAs
   BD40 [102]GSE97210 [103][80] 3 Atherosclerosis3 Healthy controls
   Normalized signal intensity (Microarray) 10,347 lncRNAs & 18,604 mRNAs
   BD40 [104]GSE120521 [105][81] 4 Atherosclerosis unstable4
   Atherosclerosis stable FPKM (RNA-Seq) 10,343 lncRNAs & 18,381 mRNAs
   BC81 [106]GSE113013 [107][27] 5 AF-VHD5 VHD Normalized signal intensity
   (Microarray) 10,347 lncRNAs & 18,604 mRNAs
   BC81 [108]GSE108660 [109][27] 5 Atrial fibrillation5 Non-atrial
   fibrillation Normalized signal intensity (Microarray) 8,090 lncRNAs &
   18,807 mRNAs
   CA23 [110]GSE106388 [111][82] 15 Mild asthma4 Healthy controls Reads
   Count (RNA-Seq) 8,036 lncRNAs & 17,244 mRNAs
   CA23 [112]GSE96783 [113][83] 21 Asthma30 Healthy controls Reads Count
   (RNA-Seq) 10,451 lncRNAs & 18,324 mRNAs
   DD71 [114]GSE128682 [115][84] 14 Ulcerative colitis16 Healthy controls
   Reads Count (RNA-Seq) 1,756 lncRNAs & 17,355 mRNAs
   4A40 [116]GSE131525 [117][85] 3 SLE3 Healthy controls Reads Count
   (RNA-Seq) 6,031 lncRNAs & 16,972 mRNAs
   5A10 [118]GSE131526 [119][85] 12 Type-1 diabetes3 Healthy controls
   Reads Count (RNA-Seq) 6,798 lncRNAs & 16,458 mRNAs
   5B81 [120]GSE129398 [121][86] 12 Obesity10 Controls Reads Count
   (RNA-Seq) 822 lncRNAs & 14,300 mRNAs
   5B81 [122]GSE145412 [123][87] 8 Obesity8 Controls TPM (RNA-Seq) 6,896
   lncRNAs & 16,595 mRNAs
   5A11 [124]GSE133099 [125][27] 6 Type-2 diabetes6 Lean controls Reads
   Count (RNA-Seq) 8,843 lncRNAs & 17,480 mRNAs
   2B33 [126]GSE141140 [127][88] 13 ALL4 Healthy controls Reads Count
   (RNA-Seq) 867 lncRNAs & 16,297 mRNAs
   2B91 [128]GSE144259 [129][89] 6 Colorectal cancer3 Healthy controls
   FPKM (RNA-Seq) 3,249 lncRNAs & 18,604 mRNAs
   2C6Z TCGA-BC [130][28] 115 Breast cancer113 Healthy controls FPKM
   (RNA-Seq) 14,097 lncRNAs & 19,631 mRNAs
   2D10 TCGA_TC [131][28] 510 Thyroid cancer58 Healthy controls Reads
   Count (RNA-Seq) 13,618 lncRNAs & 19,493 mRNAs
   [132]Open in a new tab

2.2. Collection of the SNP-disease association data for the identification of
potential disease-associated lncRNAs

   The SNP-disease association data were collected and used to identify
   potential disease-associated lncRNAs. First, we collected the 16
   diseases associated SNPs and their locations from three well-known
   sources: GRASP2 [133][33], NHGRI-EBI GWAS Catalog [134][34], and GWASdb
   [135][35]. The significance level with p less than 5.0 × 10^-8 is
   widely accepted in the genome-wide association studies [136][34]. Since
   many susceptible loci may only show moderate significance in
   association analysis, a p value of less than 1.0 × 10^-3 was applied
   for collecting the disease-associated SNPs [137][35]. Then, we
   downloaded the chromosome information of lncRNAs from the GENCODE (v31,
   human reference genome hg38) [138][36] to map the disease-associated
   SNPs to the lncRNA region. In total, we collected 124,428 associations
   between 101,360 SNPs and the 16 diseases for further analyses, and
   4,435 unique lncRNAs were found to be potentially associated with these
   diseases. Data details on the number of disease-associated SNPs and
   lncRNAs are shown in Supplementary Table S1. Finally, we exacted
   expression level of these lncRNAs in each dataset from raw lncRNA
   expression matrix, and the number of the exacted lncRNAs based on
   disease-associated SNPs for each dataset is listed in [139]Table 2.

Table 2.

   Optimization for the K[CV] and CD across different datasets. When the
   N[exp] was maximum, the lower K[CV]/CD was identified as the optimal
   value. N[exp]: the number of experimental verified lncRNAs; K[CV:] the
   top number of lncRNAs with the higher variabilities; NA: Not available.
   Disease Name Dataset ID No. of lncRNA in the specific dataset No. of
   lncRNA based on disease-associated SNP No. of experimental verified
   lncRNA K[CV] cutoff CD cutoff
   Alzheimer disease [140]GSE113524 12,937 1680 5 400 400 kb
   Alzheimer disease [141]GSE104704 2199 407 5 200 5 kb
   Alzheimer disease [142]GSE125583 2803 537 5 400 50 kb
   Major depressive disorder [143]GSE101521 11,109 1043 2 600 5 kb
   Major depressive disorder [144]GSE102556 12,718 1098 2 1000 5 kb
   Schizophrenia [145]GSE112523 12,179 917 3 300 5 kb
   Myocardial infarction [146]GSE65705 1351 35 2 35 100 kb
   Myocardial infarction [147]GSE127853 503 16 2 16 NA
   Atherosclerosis [148]GSE97210 10,347 163 1 100 NA
   Atherosclerosis [149]GSE120521 10,343 120 1 100 5 kb
   Atrial fibrillation [150]GSE113013 10,347 38 1 38 NA
   Atrial fibrillation [151]GSE108660 8090 33 1 33 NA
   Asthma [152]GSE106388 8036 291 2 200 5 kb
   Asthma [153]GSE96783 10,451 352 2 100 5 kb
   Lupus erythematosus [154]GSE131525 6031 64 1 64 5 kb
   Ulcerative colitis [155]GSE128682 1756 20 1 20 70 kb
   Type-1 diabetes mellitus [156]GSE131526 6798 283 3 200 5 kb
   Obesity [157]GSE129398 822 46 1 46 5 kb
   Obesity [158]GSE145412 6896 197 1 100 5 kb
   Type-2 diabetes mellitus [159]GSE133099 8843 1075 5 600 5 kb
   Acute lymphoblastic leukemia [160]GSE141140 867 12 1 12 NA
   Colorectal cancer [161]GSE144259 3249 43 6 43 300 kb
   Breast cancer TCGA_BC 14,097 528 12 500 5 kb
   Thyroid cancer TCGA_TC 13,618 8 1 8 NA
   [162]Open in a new tab

2.3. Detection of the expression variability of lncRNA by condition-specific
expression

   The lncRNAs have higher expression variability pattern in diseases
   compared to normal conditions. LncRNAs with relative high expression
   variability pattern may indicate disease-related function while with
   relative low variability indicate function in normal condition
   [163][16], [164][22]. The CV is the standard measurement for detecting
   the expression variability [165][16], [166][22]. The CV is defined as
   “the ratio between the standard deviation of the lncRNA expression
   levels across the patients and its mean” [167][22]. In this study, we
   used this measurement to assess the variability of potential
   disease-associated lncRNAs. The CV value (ratio) was calculated for
   each lncRNA in disease samples, and the lncRNA with relative high CV
   value represents disease associated lncRNA. Finally, we ranked the CV
   values from high to low, and then identified the lncRNAs with top
   ranked CV values as the disease-associated ones. Meanwhile, different
   top numbers were used in the following optimization procedure. Among
   the top K[CV] (the top number of lncRNAs with the higher variabilities)
   lncRNAs across each dataset, the number of experimentally validated
   lncRNAs was computed (N[exp]). When the number of lncRNA identified by
   SNPs (N[snp]) was less than 100, the K was equal to the N[snp], if
   else, the K was from 100 to N[snp] with gradient of 100. When the
   N[exp]was maximum, the lower K[CV] was identified as the optimal value.

2.4. Construction of the cis-regulatory network based on lncRNAs’ neighboring
genes

   Co-expressed genes are more likely to be co-regulated and functionally
   associated, meaning that identification of the co-expressed neighboring
   protein-coding genes can be helpful in lncRNA function assignments
   [168][16], [169][37], [170][38]. Firstly, we collected the information
   of all 16,840 lncRNAs and 19,975 protein coding genes from GENCODE
   (V31, human reference genome hg38) [171][36]. After this, we obtained
   10 candidate chromosome distances (CDs) based on the publications on
   genomic distance between the lncRNAs and their regulated neighboring
   genes. These CDs including: 5 kb [172][39], 10 kb [173][40], 20 kb
   [174][41], 50 kb [175][42], 70 kb [176][43], 100 kb [177][44], 200 kb
   [178][45], 300 kb [179][46], 400 kb [180][47], 500 kb [181][12].
   Secondly, we calculated the neighboring genes within these CDs
   up/downstream of all lncRNAs based on the collected location
   information. Therefore, a collection of neighboring genes of identified
   disease-associated lncRNAs based on SNPs and optimal K[CV] was yielded.
   Thirdly, we constructed the co-expression network between identified
   disease-associated lncRNAs and their neighboring genes in different CDs
   for each dataset using WGCNA [182][10]. Moreover, optimization
   procedure was performed to determine the optimal CD across the
   benchmark datasets. Among the lncRNAs co-expressed with neighboring
   genes, the number of experimentally validated lncRNAs was computed
   (N[exp]). When the N[exp] was maximum, the lower CD was regard as the
   optimal one. Finally, for the functional prediction, the co-expression
   network based on the optimal K[CV] and CD was constructed by WGCNA for
   each dataset. The network of selected module identified by WGCNA was
   illustrated by Cytoscape 3.7.2 ([183]http://www.cytoscape.org/)
   [184][48] software.

2.5. Annotating the lncRNA function based on KEGG pathway

   Groups of transcripts that are identified though clustering need to be
   subjected to a functional enrichment step to help in revealing the
   biological processes that these genes are involved in [185][16]. The
   KEGG pathway [186][49] is globally used for characterizing the function
   of disease-associated lncRNA. Herein, we performed the KEGG enrichment
   analyses by using the mRNAs that were found to be co-expressed with
   disease-associated lncRNAs. The statistical significance of KEGG
   pathway enrichments were determined with the hypergeometric test. A p
   value less than 0.05 indicated a significant enrichment. Also, a chord
   diagram was constructed using R package “circlize” [187][50] to
   illustrate the enrichment results.

2.6. Evaluating the ability of DAnet on the function annotation of lncRNA

   As a gold standard for verifying the DAnet analysis, 9,949 pairs of
   experimentally verified lncRNA-disease association were integrated from
   five databases including LncRNAWiki [188][29], LncRNADisease [189][14],
   LncRNA2Target [190][30], Lnc2Cancer [191][31], and EVLncRNAs [192][32],
   which provided many experimental verified lncRNAs for diseases. Two
   metrics were employed to evaluate the ability of the DAnet in
   characterizing the function of disease-associated lncRNAs. Both metrics
   were based on experimentally validated disease associated lncRNAs. The
   metrics included: (1) percentage of successful prediction (Rate), and
   (2) enrichment factor (EF). The Rate (%) of DAnet and DEA
   (Supplementary Method S1) in characterizing the experimental verified
   lncRNAs was employed as the first metric to evaluate the performances.
   Also, EF was used to represent the comparison between the concentration
   of the experimentally verified lncRNAs in the identification results of
   DAnet/DEA and the concentration in the entire lncRNAs expression. The
   false discovery can be effectively evaluated by fully considering the
   experimentally validated disease associated lncRNAs [193][51]. The
   formula for EF is given:
   [MATH: <mrow><mi mathvariant="italic">EF</mi><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><msub><mi>N</mi><mrow><mi
   mathvariant="normal">t</mi><mi mathvariant="normal">r</mi><mi
   mathvariant="normal">u</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">s</mi><mi mathvariant="normal">u</mi><mi
   mathvariant="normal">c</mi></mrow></msub><mo
   stretchy="false">/</mo><msub><mi>N</mi><mrow><mi
   mathvariant="normal">s</mi><mi mathvariant="normal">u</mi><mi
   mathvariant="normal">c</mi></mrow></msub></mrow><mrow><msub><mi>N</mi><
   mrow><mi mathvariant="normal">t</mi><mi mathvariant="normal">r</mi><mi
   mathvariant="normal">u</mi><mi
   mathvariant="normal">e</mi></mrow></msub><mo
   stretchy="false">/</mo><msub><mi>N</mi><mrow><mi
   mathvariant="normal">a</mi><mi mathvariant="normal">l</mi><mi
   mathvariant="normal">l</mi></mrow></msub></mrow></mfrac></mrow> :MATH]

   where N[truesuc] denoted the number of experimental verified lncRNAs
   successfully characterized as ‘disease-associated’ by DAnet or DEA;
   N[suc] represented the number of lncRNAs characterized as
   ‘disease-associated’ by DAnet or DEA; N[true] was the number of
   experimental verified lncRNAs in the integrated experimentally verified
   lncRNAs-disease associations; and N[all] indicated the total number of
   lncRNAs in the expression matrix. The EF no less than 1 indicated that
   there is an enrichment. The larger EF value represented the lower FDR
   [194][51].

3. Results

3.1. Identification of disease-specific lncRNA by SNPs across the benchmark
datasets

   More than 90% of disease-associated SNPs are actually located in the
   non-coding region (e.g., lncRNAs). The SNPs located in lncRNAs can
   either modify their secondary structure or affect their expression
   level [195][20]. As described in the Methods section, potential
   disease-associated lncRNAs of the 24 benchmark datasets were identified
   by disease-associated SNPs for DAnet analysis. The differential
   expressed lncRNAs were regarded as disease-associated lncRNAs for DEA
   (Supplementary Method S1). Subsequently, the Rate was utilized as a
   metric to measure the performance of DAnet and DEA about identifying
   experimentally verified lncRNAs. As shown in [196]Supplementary Fig.
   S1, the Rate value of each dataset by the adjusted p value (from 0% for
   18 datasets to 16.7% for [197]GSE125583) was lower than that by the p
   value (from 0% for 11 datasets to 33.3% for [198]GSE106388). Among the
   24 datasets, there were 8 datasets with no differentially expressed
   genes using the FDR less than 0.05. Thus, the raw p value (p less
   than 0.05) was used for identifying the differential expressed lncRNAs
   across the 24 datasets.

   As shown in [199]Fig. 1, the Rate of DAnet was varied (from 2.6% for
   TCGA-TC to 100% for [200]GSE113013 and [201]GSE108660) and the Rate of
   DEA was also differed greatly (from 0% for 11 datasets to 33.3% for
   [202]GSE106388). The Rate of DAnet was generally no less than DEA
   across 24 benchmark datasets. Moreover, among the 24 benchmark
   datasets, two datasets [203]GSE97210 and [204]GSE120521 from the
   atherosclerosis were collected from the microarray and RNA-Seq,
   respectively. We further compared the differences between the
   microarray and RNA-Seq data in terms of the originally detected
   lncRNAs, the potential disease-associated lncRNAs and the
   experimentally validated lncRNAs. As shown in the [205]Supplementary
   Fig. S2, the total number of the originally detected lncRNAs for
   [206]GSE97210 and [207]GSE120521 was 10,347 and 10343, respectively.
   The number of lncRNAs detected by both [208]GSE97210 and [209]GSE120521
   was 6836 (highlighted in blue and red lines). The number of potential
   disease-associated lncRNAs for [210]GSE97210 and [211]GSE120521 was 163
   and 120, respectively. The number of shared lncRNAs was 111
   (highlighted in green and red lines). In both [212]GSE97210 and
   [213]GSE120521, the experimentally validated lncRNA (CDKN2B-AS1) was
   identified via the DAnet. These findings indicate that both
   [214]GSE97210 and [215]GSE120521 are consistent in identifying the
   experimentally validated lncRNA.

Fig. 1.

   [216]Fig. 1
   [217]Open in a new tab

   Performance comparison between DAnet and DEA across the 24 benchmark
   datasets (shown in [218]Table 1) based on the percentage of successful
   prediction (Rate, %), the Rate was for characterizing the
   experimentally verified disease associated lncRNAs.

   Similarly, the EF was employed to assess the ability of DAnet and DEA
   about controlling the false characterization. As shown in [219]Fig. 2,
   the EF of DAnet was differed greatly (from 2.2 for [220]GSE125583 to
   272.3 for [221]GSE113013) and the EF of DEA was also varied (from 0.0
   for 11 datasets to 9.2 for [222]GSE106388). The EF of DAnet was
   generally no less than DEA of each dataset and all EFs of DAnet were
   greater than one.

Fig. 2.

   [223]Fig. 2
   [224]Open in a new tab

   Performance comparison between DAnet and DEA across the 24 benchmark
   datasets (shown in [225]Table 1) based on the enrichment factor (EF),
   the EF represented the comparison between the concentration of the
   experimentally verified lncRNAs in the identification results of
   DAnet/DEA and the concentration in the entire lncRNAs expression.

3.2. Optimizing the K[CV] and CD parameters across the benchmark datasets

   In order to identify more likely disease-associated lncRNAs,
   optimization procedure was performed to determine the optimal K[CV] and
   CD across the benchmark datasets. As shown in [226]Fig. 3, the optimal
   K[CV] represented in red square was varied across the datasets (from 8
   for TCGA-TC to 1000 for [227]GSE102556), and the CV of experimentally
   verified disease-associated lncRNAs was generally higher. [228]Table 2
   shows the optimal K[CV] value across the datasets. Moreover, as shown
   in [229]Supplementary Fig. S3, the optimal CD represented in red square
   was different across the datasets (from 5 kb for 13 datasets to 400 kb
   for [230]GSE113524). [231]Table 2 shows the optimal CD across the
   datasets. For six datasets ([232]GSE127853, [233]GSE97210,
   [234]GSE113013, [235]GSE108660, [236]GSE141140, TCGA_TC), the CD was
   not available.

Fig. 3.

   [237]Fig. 3
   [238]Open in a new tab

   Optimization for the K[CV] across these benchmark datasets. X axis: the
   top number of lncRNAs with the higher variabilities, Y axis: the number
   of experimental verified lncRNA (N[exp]). When the number of lncRNA
   identified by SNPs (N[snp]) was less than 100, the K was equal to the
   N[snp], if else, the K was from 100 to N[snp] with gradient of 100.

3.3. The function of lncRNA in disease characterized by DAnet

3.3.1. KEGG enrichment analysis to character lncRNA function

   Moreover, the co-expression network of lncRNAs and neighboring mRNAs
   was constructed under the optimal K[CV] and CD by WGCNA for each
   dataset. The network of module (contains the most genes with
   significant correlation) were displayed by Cytoscape [239][48]. Four
   networks are shown in [240]Fig. 4 A-D as examples, the light-yellow
   square represented the lncRNA and the blue dot represented the
   co-expressed mRNA in the cis-lncRNA regulatory networks, red edge
   represented the association between disease-associated lncRNA and
   neighboring mRNA. Other 14 networks are shown in [241]Supplementary
   Fig. S4. For each dataset, the KEGG enrichment analysis was performed
   to character lncRNA function via the co-expressed mRNAs. A chord
   diagram was dawn for illustrating the significantly enriched pathways
   across different datasets ([242]Fig. 4 E). As shown in [243]Fig. 4 E,
   the enriched pathways reported to be associated with the disease
   studied were indicated in blue lines, and other pathways were shown in
   grey lines. The statistical results of disease-related pathways in each
   dataset are shown in [244]Fig. 4 F. As shown, the percentage of
   disease-associated pathways were differed from 40% to 100% across
   datasets. The detail ed descriptions on relevance between disease and
   pathways are provided in Supplementary Table S2.

Fig. 4.

   [245]Fig. 4
   [246]Open in a new tab

   The function of lncRNA in disease characterized by DAnet. A-D:
   co-expression network of module (contains the most genes with
   significant correlation) constructed by WGCNA for each dataset. A:
   [247]GSE113524, B: [248]GSE65705, C: [249]GSE131525, D: [250]GSE131526,
   green square: lncRNA, blue dot: mRNA. E: chord diagram of enriched
   pathways of 15 benchmark datasets (p less than 0.05). F: the statistic
   of diseases-associated pathways.

3.3.2. Association between lncRNAs identified by DAnet and the specific
disease

   Finally, the relationships of lncRNAs and diseases were systemic
   manually searched. As illustrated in [251]Fig. 5, 41 directly
   diseases-associated lncRNAs were identified for most diseases (blue
   lines). In particular, 13 lncRNAs were identified for Alzheimer disease
   (orange square, 8A20), three for major depressive disorder (brown
   square, 6A70), four for schizophrenia (brown square, 6A20), 12 for
   myocardial infarction (blue square, BA41), two for atherosclerosis
   (blue square, BD40), six for asthma (pink square, CA23), one for lupus
   erythematosus (purple square, 4A40), one for ulcerative colitis
   (turquoise square, DD71), five for obesity (yellow square, 5B81), six
   for type-2 diabetes mellitus (yellow square, 5A11), three for
   colorectal cancer (green square, 2B91), six for breast cancer (green
   square, 2C6Z). The detailed descriptions on relevance between lncRNAs
   and the specific disease are provided in Supplementary Table S3.

Fig. 5.

   [252]Fig. 5
   [253]Open in a new tab

   Associations between lncRNAs identified by DAnet and the specific
   disease. The blue lines mean the reported associations between lncRNAs
   and diseases. The squares represent the type of diseases. The dots
   indicate lncRNAs identified by DAnet. Orange square: diseases of the
   nervous system; brown square: mental, behavioural or neurodevelopmental
   disorders; blue square: circulatory system disease; pink square:
   diseases of the respiratory system; purple square: diseases of the
   immune system; turquoise square: diseases of the digestive system;
   yellow square: endocrine, nutritional or metabolic diseases; green
   square: neoplasms; grey dot: lncRNA not reported in the studied
   disease; green dot: lncRNA associated with a single disease; red dot,
   lncRNA associated with multiple diseases.

   Meanwhile, as illustrated in [254]Fig. 5, the lncRNAs (red dots)
   associated with multiple diseases were identified. Specifically, two
   lncRNAs (LINC-PINT, GAS5) were associated both with Alzheimer disease
   and type-2 diabetes mellitus [255][52], [256][53], [257][54],
   [258][55], [259][56], SOX2-OT was associated with Alzheimer disease and
   asthma [260][57], [261][58], CCDC39 was associated with asthma and
   schizophrenia [262][59], [263][60], HCP5 was associated with asthma and
   breast cancer [264][61], [265][62], IFNG-AS1 was associated with asthma
   and ulcerative colitis [266][63], [267][64], CDKN2B-AS1 was associated
   with five diseases including Alzheimer disease, myocardial infarction,
   atherosclerosis, type-2 diabetes mellitus, and breast cancer [268][65],
   [269][66], [270][67], [271][68], [272][69], [273][70].

4. Discussion

   Functional annotation of lncRNAs in diseases has attracted great
   attention for understanding disease etiology. In this study, we
   proposed a novel strategy termed DAnet by combining disease
   associations with cis-regulated network between lncRNAs and neighboring
   protein-coding genes for improving the functional annotation of
   lncRNAs. The strategy mainly consists of three procedures including:
   (1) identifying potential disease-associated lncRNAs based on
   disease-associated SNPs, (2) detecting more likely disease-associated
   lncRNAs based on expression variability, (3) developing cis-regulated
   networks between disease-associated lncRNAs and their neighboring
   protein-coding genes. To widen the scope of DAnet to other RNA-seq or
   Microarray data, the code of DAnet was provided in Supplementary Method
   S2. DAnet can be expected to identify the specific lncRNA function in
   the given disease.

   Primarily, based on the analysis of 24 datasets involving 16 diseases,
   the Rate value of DAnet was overall higher than the DEA, which
   indicates that the performance of DAnet could be better than
   traditional differential expression-based analysis on identification of
   experimentally validated lncRNA. In addition, the EF of DAnet was
   overall higher than the DEA. All EFs of DAnet were higher than 1. These
   findings indicate the superior capacity of DAnet in controlling the
   false characterization of lncRNA function. Furthermore, during the
   optimization procedure for determining the optimal K[CV], we found that
   the experimentally verified disease-associated lncRNAs were generally
   with higher CV values. This finding is consistent with those reported
   by other investigators [274][16], [275][17], [276][18]. Under the
   optimal K[CV], the optimal CD was not available for these six datasets
   ([277]GSE127853, [278]GSE97210, [279]GSE113013, [280]GSE108660,
   [281]GSE141140, TCGA_TC). This may be attributed to the effect of the
   small number of samples and the few numbers of lncRNAs/mRNAs in the
   co-expression analysis [282][71]. Finally, the KEGG enrichment results
   indicate most biological pathways identified by DAnet were associated
   with the corresponding disease (from 40% to 100%). And by DAnet,
   directly diseases-associated lncRNAs were identified for most diseases.
   Moreover, lncRNAs associated with multiple diseases were also
   identified.

5. Conclusions

   A new strategy integrating disease associations was developed for
   obtaining the lower false discovery rate in functional annotation of
   lncRNAs. The analysis of 24 datasets involving 16 diseases, indicated
   that the performance of DAnet could be better than traditional
   differential expression-based on identification of experimentally
   validated lncRNA, and the most biological pathways identified by DAnet
   were associated with the studied diseases. This provides a way to study
   the function of lncRNA in diseases from another aspect. In sum, DAnet
   is expected to identify the specific lncRNA function in the given
   disease.

Contributors

   J.T. and Y.W. conceived the idea and supervised the work. Y.W., J.Z.,
   and X.W., performed the research. Y.W., J.Z., X.W., Adu-Gyamfi E.,
   L.Y., T.L., M.W., Y.D., and F.Z. prepared and analyzed the data. J.T.
   and Y.W. wrote manuscript. All authors reviewed and approved the final
   version of the manuscript.

CRediT authorship contribution statement

   Yongheng Wang: Formal analysis, Writing – original draft, Writing –
   review & editing, Visualization. Jincheng Zhai: Formal analysis,
   Investigation. Xianglu Wu: Investigation, Visualization. Enoch Appiah
   Adu-Gyamfi: Validation, Writing – review & editing. Lingping Yang:
   Investigation, Validation. Taihang Liu: Validation. Meijiao Wang:
   Validation. Yubin Ding: Project administration. Feng Zhu:
   Conceptualization, Project administration. Yingxiong Wang:
   Conceptualization, Supervision, Funding acquisition. Jing Tang:
   Conceptualization, Writing – original draft, Writing – review &
   editing, Supervision.

Declaration of Competing Interest

   The authors declare that they have no known competing financial
   interests or personal relationships that could have appeared to
   influence the work reported in this paper.

Acknowledgments