Graphical abstract [43]graphic file with name ga1.jpg [44]Open in a new tab Keywords: Long non‐coding RNA, Functional prediction, Disease-associated SNPs, Coefficient of variation, WGCNA Abstract The long non‐coding RNAs (lncRNAs) play critical roles in various biological processes and are associated with many diseases. Functional annotation of lncRNAs in diseases attracts great attention in understanding their etiology. However, the traditional co-expression-based analysis usually produces a significant number of false positive function assignments. It is thus crucial to develop a new approach to obtain lower false discovery rate for functional annotation of lncRNAs. Here, a novel strategy termed DAnet which combining disease associations with cis-regulatory network between lncRNAs and neighboring protein-coding genes was developed, and the performance of DAnet was systematically compared with that of the traditional differential expression-based approach. Based on a gold standard analysis of the experimentally validated lncRNAs, the proposed strategy was found to perform better in identifying the experimentally validated lncRNAs compared with the other method. Moreover, the majority of biological pathways (40%∼100%) identified by DAnet were reported to be associated with the studied diseases. In sum, the DAnet is expected to be used to identify the function of specific lncRNAs in a particular disease or multiple diseases. 1. Introduction Long non‐coding RNA (lncRNA) is broadly defined as a type of non-coding RNA with a length of more than 200 nucleotides [45][1]. Tremendous evidences have shown that lncRNA can carry out diverse functions in biological processes [46][2] and is associated with many diseases [47][3], such as cancers [48][4], cardiovascular diseases [49][5], neurodegenerative diseases [50][6], metabolic diseases [51][7], and inflammatory diseases [52][8]. Currently, many computational methods for predicting lncRNA function have been developed [53][9], for instance, the differential expression analysis (DEA) combined with the weighted correlation network analysis (WGCNA) [54][10]. This method has been frequently employed for identifying co-regulatory relationships among lncRNAs and mRNAs in polycystic ovary syndrome [55][11] and discovering the cis-regulatory lncRNAs involved in vascular inflammation [56][12]. However, analysis based on co-expression usually results in a large number of false positive function assignments [57][9]. Currently, the lncRNA-disease association data supported by experiments are quite limited in the publications [58][13]. Specifically, only about 6,000 of over 90,000 lncRNAs have been characterized by experiments as “disease-associated” in human genome [59][14], [60][15]. This may be attributed to the complex characteristics of lncRNA, including the higher expression variability across disease conditions [61][16], [62][17], [63][18], the susceptibility on expression/secondary structure to genetic variants [64][19], [65][20], [66][21], and the various levels of regulation on the coding genes (cis/trans) [67][2], [68][18], etc. So far, the analysis considering disease specificity into lncRNA functional annotation can improve the discovery of diseased-associated lncRNA [69][16]. In particular, lncRNA-disease associations can be well-established via the single nucleotide polymorphisms (SNPs) type of genetic variants within lncRNAs [70][16] and condition-specific analysis estimated by the coefficient of variation (CV) [71][17], [72][22]. Moreover, lots of lncRNAs have been reported to regulate the expression of their neighboring genes (act in cis) [73][23], [74][24], [75][25]. The co-expression of the cis-regulatory lncRNAs and their neighboring protein-coding genes led to the discovery of functional lncRNAs in given disease [76][26]. It is therefore crucial to develop a new approach integrating diseased associations for obtaining lower false discovery rate (FDR) [77][16]. In this study, a novel strategy termed DAnet which combining disease associations with cis-regulatory network was developed. In particular, disease-associated SNPs were first integrated for screening disease-associated lncRNAs. And then the CV of these lncRNAs was estimated to assess the condition-specific expression of lncRNAs in a specific disease. Moreover, the WGCNA-based co-expression network between lncRNAs and their neighboring protein-coding genes and Kyoto Encyclopedia of Genes and Genome (KEGG) pathway enrichment analysis were further conducted for identifying the function of the lncRNAs involved. Furthermore, experimentally verified lncRNA-disease associations were curated to evaluate the performance of this newly proposed strategy across 24 datasets involving eight types of disease based on classification of the ICD-11. Overall, the findings of this study can facilitate the discovery of disease-associated lncRNAs and their function in the specific disease. 2. Methods 2.1. Collection of the benchmark datasets for the analysis For the function analysis of lncRNA in different type of diseases, a variety of microarray/RNA-seq data were collected by searching disease names in Gene Expression Omnibus (GEO) [78][27] and The Cancer Genome Atlas (TCGA) [79][28]. We considered several criteria: (1) the gene expression profiling was conducted using high throughput sequencing or lncRNA microarray for “Homo sapiens”, (2) the dataset consist of patient and control groups, (3) the raw data or normalized data were available, (4) the number of lncRNAs identified by disease-associated SNPs was more than zero, (5) the experimentally validated disease associated lncRNAs, which obtained from 5 public databases (LncRNAWiki [80][29], LncRNADisease [81][14], LncRNA2Target [82][30], Lnc2Cancer [83][31], and EVLncRNAs [84][32]), were available for the diseases and (6) multiple types of disease based on classification of the ICD-11. In total, 22 benchmark datasets were collected from GEO and two datasets were collected from TCGA, which included 16 diseases, divided into 8 types of disease according to the classification of ICD-11. Then, the lncRNA and mRNA expression matrices obtained from the 24 datasets of control-case studies were used for subsequent analysis. [85]Table 1 demonstrates the disease type (ICD-11 code), dataset ID, the numbers of sample, the expression unit, and the number of lncRNAs and mRNAs for each dataset. Table 1. Twenty-four datasets of eight disease types were collected for function analysis of lncRNA. The first 22 datasets were collected from GEO and the last two datasets were collected from TCGA. MDD: major depressive disorder; VHD: valvular heart disease; AF-VHD: valvular heart disease with atrial fibrillation; SLE: systemic lupus erythematosus; ALL: acute lymphoblastic leukemia; TPM: Transcripts Per Million; Normalized: DESeq normalized; nRPKM: normalized Reads Per Kilobase of transcript, per Million mapped reads; FPKM: Fragments Per Kilobase of exon per Million; RPKM: Reads Per Kilobase of transcript per Million reads mapped; Normalized signal intensity: Quantile normalization using the GeneSpring software. Type of Disease Dataset ID No. of Sample in the specific dataset Expression Unit (Experiment type) No. of lncRNAs & mRNAs 8A20 [86]GSE113524 [87][72] 19 Alzheimer disease20 Healthy controls TPM (RNA-Seq) 12,937 lncRNAs & 18,969 mRNAs 8A20 [88]GSE104704 [89][73] 12 Alzheimer disease10 Healthy controls Normalized (RNA-Seq) 2,199 lncRNAs & 17,965 mRNAs 8A20 [90]GSE125583 [91][74] 219 Alzheimer disease70 Healthy controls nRPKM (RNA-Seq) 2,803 lncRNAs & 18,852 mRNAs 6A70 [92]GSE101521 [93][75] 30 MDD29 Healthy controls Normalized (RNA-Seq) 11,109 lncRNAs & 18,754 mRNAs 6A70 [94]GSE102556 [95][76] 26 MDD22 Healthy controls FPKM (RNA-Seq) 12,718 lncRNAs & 18,793 mRNAs 6A20 [96]GSE112523 [97][77] 29 Schizophrenia28 Healthy controls Reads Count (RNA-Seq) 12,179 lncRNAs & 18,437 mRNAs BA41 [98]GSE65705 [99][78] 32 Myocardial infarction2 Healthy controls RPKM (RNA-Seq) 1,351 lncRNAs & 17,801 mRNAs BA41 [100]GSE127853 [101][79] 3 Myocardial infarction3 Healthy controls FPKM (RNA-Seq) 503 lncRNAs & 10,216 mRNAs BD40 [102]GSE97210 [103][80] 3 Atherosclerosis3 Healthy controls Normalized signal intensity (Microarray) 10,347 lncRNAs & 18,604 mRNAs BD40 [104]GSE120521 [105][81] 4 Atherosclerosis unstable4 Atherosclerosis stable FPKM (RNA-Seq) 10,343 lncRNAs & 18,381 mRNAs BC81 [106]GSE113013 [107][27] 5 AF-VHD5 VHD Normalized signal intensity (Microarray) 10,347 lncRNAs & 18,604 mRNAs BC81 [108]GSE108660 [109][27] 5 Atrial fibrillation5 Non-atrial fibrillation Normalized signal intensity (Microarray) 8,090 lncRNAs & 18,807 mRNAs CA23 [110]GSE106388 [111][82] 15 Mild asthma4 Healthy controls Reads Count (RNA-Seq) 8,036 lncRNAs & 17,244 mRNAs CA23 [112]GSE96783 [113][83] 21 Asthma30 Healthy controls Reads Count (RNA-Seq) 10,451 lncRNAs & 18,324 mRNAs DD71 [114]GSE128682 [115][84] 14 Ulcerative colitis16 Healthy controls Reads Count (RNA-Seq) 1,756 lncRNAs & 17,355 mRNAs 4A40 [116]GSE131525 [117][85] 3 SLE3 Healthy controls Reads Count (RNA-Seq) 6,031 lncRNAs & 16,972 mRNAs 5A10 [118]GSE131526 [119][85] 12 Type-1 diabetes3 Healthy controls Reads Count (RNA-Seq) 6,798 lncRNAs & 16,458 mRNAs 5B81 [120]GSE129398 [121][86] 12 Obesity10 Controls Reads Count (RNA-Seq) 822 lncRNAs & 14,300 mRNAs 5B81 [122]GSE145412 [123][87] 8 Obesity8 Controls TPM (RNA-Seq) 6,896 lncRNAs & 16,595 mRNAs 5A11 [124]GSE133099 [125][27] 6 Type-2 diabetes6 Lean controls Reads Count (RNA-Seq) 8,843 lncRNAs & 17,480 mRNAs 2B33 [126]GSE141140 [127][88] 13 ALL4 Healthy controls Reads Count (RNA-Seq) 867 lncRNAs & 16,297 mRNAs 2B91 [128]GSE144259 [129][89] 6 Colorectal cancer3 Healthy controls FPKM (RNA-Seq) 3,249 lncRNAs & 18,604 mRNAs 2C6Z TCGA-BC [130][28] 115 Breast cancer113 Healthy controls FPKM (RNA-Seq) 14,097 lncRNAs & 19,631 mRNAs 2D10 TCGA_TC [131][28] 510 Thyroid cancer58 Healthy controls Reads Count (RNA-Seq) 13,618 lncRNAs & 19,493 mRNAs [132]Open in a new tab 2.2. Collection of the SNP-disease association data for the identification of potential disease-associated lncRNAs The SNP-disease association data were collected and used to identify potential disease-associated lncRNAs. First, we collected the 16 diseases associated SNPs and their locations from three well-known sources: GRASP2 [133][33], NHGRI-EBI GWAS Catalog [134][34], and GWASdb [135][35]. The significance level with p less than 5.0 × 10^-8 is widely accepted in the genome-wide association studies [136][34]. Since many susceptible loci may only show moderate significance in association analysis, a p value of less than 1.0 × 10^-3 was applied for collecting the disease-associated SNPs [137][35]. Then, we downloaded the chromosome information of lncRNAs from the GENCODE (v31, human reference genome hg38) [138][36] to map the disease-associated SNPs to the lncRNA region. In total, we collected 124,428 associations between 101,360 SNPs and the 16 diseases for further analyses, and 4,435 unique lncRNAs were found to be potentially associated with these diseases. Data details on the number of disease-associated SNPs and lncRNAs are shown in Supplementary Table S1. Finally, we exacted expression level of these lncRNAs in each dataset from raw lncRNA expression matrix, and the number of the exacted lncRNAs based on disease-associated SNPs for each dataset is listed in [139]Table 2. Table 2. Optimization for the K[CV] and CD across different datasets. When the N[exp] was maximum, the lower K[CV]/CD was identified as the optimal value. N[exp]: the number of experimental verified lncRNAs; K[CV:] the top number of lncRNAs with the higher variabilities; NA: Not available. Disease Name Dataset ID No. of lncRNA in the specific dataset No. of lncRNA based on disease-associated SNP No. of experimental verified lncRNA K[CV] cutoff CD cutoff Alzheimer disease [140]GSE113524 12,937 1680 5 400 400 kb Alzheimer disease [141]GSE104704 2199 407 5 200 5 kb Alzheimer disease [142]GSE125583 2803 537 5 400 50 kb Major depressive disorder [143]GSE101521 11,109 1043 2 600 5 kb Major depressive disorder [144]GSE102556 12,718 1098 2 1000 5 kb Schizophrenia [145]GSE112523 12,179 917 3 300 5 kb Myocardial infarction [146]GSE65705 1351 35 2 35 100 kb Myocardial infarction [147]GSE127853 503 16 2 16 NA Atherosclerosis [148]GSE97210 10,347 163 1 100 NA Atherosclerosis [149]GSE120521 10,343 120 1 100 5 kb Atrial fibrillation [150]GSE113013 10,347 38 1 38 NA Atrial fibrillation [151]GSE108660 8090 33 1 33 NA Asthma [152]GSE106388 8036 291 2 200 5 kb Asthma [153]GSE96783 10,451 352 2 100 5 kb Lupus erythematosus [154]GSE131525 6031 64 1 64 5 kb Ulcerative colitis [155]GSE128682 1756 20 1 20 70 kb Type-1 diabetes mellitus [156]GSE131526 6798 283 3 200 5 kb Obesity [157]GSE129398 822 46 1 46 5 kb Obesity [158]GSE145412 6896 197 1 100 5 kb Type-2 diabetes mellitus [159]GSE133099 8843 1075 5 600 5 kb Acute lymphoblastic leukemia [160]GSE141140 867 12 1 12 NA Colorectal cancer [161]GSE144259 3249 43 6 43 300 kb Breast cancer TCGA_BC 14,097 528 12 500 5 kb Thyroid cancer TCGA_TC 13,618 8 1 8 NA [162]Open in a new tab 2.3. Detection of the expression variability of lncRNA by condition-specific expression The lncRNAs have higher expression variability pattern in diseases compared to normal conditions. LncRNAs with relative high expression variability pattern may indicate disease-related function while with relative low variability indicate function in normal condition [163][16], [164][22]. The CV is the standard measurement for detecting the expression variability [165][16], [166][22]. The CV is defined as “the ratio between the standard deviation of the lncRNA expression levels across the patients and its mean” [167][22]. In this study, we used this measurement to assess the variability of potential disease-associated lncRNAs. The CV value (ratio) was calculated for each lncRNA in disease samples, and the lncRNA with relative high CV value represents disease associated lncRNA. Finally, we ranked the CV values from high to low, and then identified the lncRNAs with top ranked CV values as the disease-associated ones. Meanwhile, different top numbers were used in the following optimization procedure. Among the top K[CV] (the top number of lncRNAs with the higher variabilities) lncRNAs across each dataset, the number of experimentally validated lncRNAs was computed (N[exp]). When the number of lncRNA identified by SNPs (N[snp]) was less than 100, the K was equal to the N[snp], if else, the K was from 100 to N[snp] with gradient of 100. When the N[exp]was maximum, the lower K[CV] was identified as the optimal value. 2.4. Construction of the cis-regulatory network based on lncRNAs’ neighboring genes Co-expressed genes are more likely to be co-regulated and functionally associated, meaning that identification of the co-expressed neighboring protein-coding genes can be helpful in lncRNA function assignments [168][16], [169][37], [170][38]. Firstly, we collected the information of all 16,840 lncRNAs and 19,975 protein coding genes from GENCODE (V31, human reference genome hg38) [171][36]. After this, we obtained 10 candidate chromosome distances (CDs) based on the publications on genomic distance between the lncRNAs and their regulated neighboring genes. These CDs including: 5 kb [172][39], 10 kb [173][40], 20 kb [174][41], 50 kb [175][42], 70 kb [176][43], 100 kb [177][44], 200 kb [178][45], 300 kb [179][46], 400 kb [180][47], 500 kb [181][12]. Secondly, we calculated the neighboring genes within these CDs up/downstream of all lncRNAs based on the collected location information. Therefore, a collection of neighboring genes of identified disease-associated lncRNAs based on SNPs and optimal K[CV] was yielded. Thirdly, we constructed the co-expression network between identified disease-associated lncRNAs and their neighboring genes in different CDs for each dataset using WGCNA [182][10]. Moreover, optimization procedure was performed to determine the optimal CD across the benchmark datasets. Among the lncRNAs co-expressed with neighboring genes, the number of experimentally validated lncRNAs was computed (N[exp]). When the N[exp] was maximum, the lower CD was regard as the optimal one. Finally, for the functional prediction, the co-expression network based on the optimal K[CV] and CD was constructed by WGCNA for each dataset. The network of selected module identified by WGCNA was illustrated by Cytoscape 3.7.2 ([183]http://www.cytoscape.org/) [184][48] software. 2.5. Annotating the lncRNA function based on KEGG pathway Groups of transcripts that are identified though clustering need to be subjected to a functional enrichment step to help in revealing the biological processes that these genes are involved in [185][16]. The KEGG pathway [186][49] is globally used for characterizing the function of disease-associated lncRNA. Herein, we performed the KEGG enrichment analyses by using the mRNAs that were found to be co-expressed with disease-associated lncRNAs. The statistical significance of KEGG pathway enrichments were determined with the hypergeometric test. A p value less than 0.05 indicated a significant enrichment. Also, a chord diagram was constructed using R package “circlize” [187][50] to illustrate the enrichment results. 2.6. Evaluating the ability of DAnet on the function annotation of lncRNA As a gold standard for verifying the DAnet analysis, 9,949 pairs of experimentally verified lncRNA-disease association were integrated from five databases including LncRNAWiki [188][29], LncRNADisease [189][14], LncRNA2Target [190][30], Lnc2Cancer [191][31], and EVLncRNAs [192][32], which provided many experimental verified lncRNAs for diseases. Two metrics were employed to evaluate the ability of the DAnet in characterizing the function of disease-associated lncRNAs. Both metrics were based on experimentally validated disease associated lncRNAs. The metrics included: (1) percentage of successful prediction (Rate), and (2) enrichment factor (EF). The Rate (%) of DAnet and DEA (Supplementary Method S1) in characterizing the experimental verified lncRNAs was employed as the first metric to evaluate the performances. Also, EF was used to represent the comparison between the concentration of the experimentally verified lncRNAs in the identification results of DAnet/DEA and the concentration in the entire lncRNAs expression. The false discovery can be effectively evaluated by fully considering the experimentally validated disease associated lncRNAs [193][51]. The formula for EF is given: [MATH: EF=Ntruesuc/NsucN< mrow>true/Nall :MATH] where N[truesuc] denoted the number of experimental verified lncRNAs successfully characterized as ‘disease-associated’ by DAnet or DEA; N[suc] represented the number of lncRNAs characterized as ‘disease-associated’ by DAnet or DEA; N[true] was the number of experimental verified lncRNAs in the integrated experimentally verified lncRNAs-disease associations; and N[all] indicated the total number of lncRNAs in the expression matrix. The EF no less than 1 indicated that there is an enrichment. The larger EF value represented the lower FDR [194][51]. 3. Results 3.1. Identification of disease-specific lncRNA by SNPs across the benchmark datasets More than 90% of disease-associated SNPs are actually located in the non-coding region (e.g., lncRNAs). The SNPs located in lncRNAs can either modify their secondary structure or affect their expression level [195][20]. As described in the Methods section, potential disease-associated lncRNAs of the 24 benchmark datasets were identified by disease-associated SNPs for DAnet analysis. The differential expressed lncRNAs were regarded as disease-associated lncRNAs for DEA (Supplementary Method S1). Subsequently, the Rate was utilized as a metric to measure the performance of DAnet and DEA about identifying experimentally verified lncRNAs. As shown in [196]Supplementary Fig. S1, the Rate value of each dataset by the adjusted p value (from 0% for 18 datasets to 16.7% for [197]GSE125583) was lower than that by the p value (from 0% for 11 datasets to 33.3% for [198]GSE106388). Among the 24 datasets, there were 8 datasets with no differentially expressed genes using the FDR less than 0.05. Thus, the raw p value (p less than 0.05) was used for identifying the differential expressed lncRNAs across the 24 datasets. As shown in [199]Fig. 1, the Rate of DAnet was varied (from 2.6% for TCGA-TC to 100% for [200]GSE113013 and [201]GSE108660) and the Rate of DEA was also differed greatly (from 0% for 11 datasets to 33.3% for [202]GSE106388). The Rate of DAnet was generally no less than DEA across 24 benchmark datasets. Moreover, among the 24 benchmark datasets, two datasets [203]GSE97210 and [204]GSE120521 from the atherosclerosis were collected from the microarray and RNA-Seq, respectively. We further compared the differences between the microarray and RNA-Seq data in terms of the originally detected lncRNAs, the potential disease-associated lncRNAs and the experimentally validated lncRNAs. As shown in the [205]Supplementary Fig. S2, the total number of the originally detected lncRNAs for [206]GSE97210 and [207]GSE120521 was 10,347 and 10343, respectively. The number of lncRNAs detected by both [208]GSE97210 and [209]GSE120521 was 6836 (highlighted in blue and red lines). The number of potential disease-associated lncRNAs for [210]GSE97210 and [211]GSE120521 was 163 and 120, respectively. The number of shared lncRNAs was 111 (highlighted in green and red lines). In both [212]GSE97210 and [213]GSE120521, the experimentally validated lncRNA (CDKN2B-AS1) was identified via the DAnet. These findings indicate that both [214]GSE97210 and [215]GSE120521 are consistent in identifying the experimentally validated lncRNA. Fig. 1. [216]Fig. 1 [217]Open in a new tab Performance comparison between DAnet and DEA across the 24 benchmark datasets (shown in [218]Table 1) based on the percentage of successful prediction (Rate, %), the Rate was for characterizing the experimentally verified disease associated lncRNAs. Similarly, the EF was employed to assess the ability of DAnet and DEA about controlling the false characterization. As shown in [219]Fig. 2, the EF of DAnet was differed greatly (from 2.2 for [220]GSE125583 to 272.3 for [221]GSE113013) and the EF of DEA was also varied (from 0.0 for 11 datasets to 9.2 for [222]GSE106388). The EF of DAnet was generally no less than DEA of each dataset and all EFs of DAnet were greater than one. Fig. 2. [223]Fig. 2 [224]Open in a new tab Performance comparison between DAnet and DEA across the 24 benchmark datasets (shown in [225]Table 1) based on the enrichment factor (EF), the EF represented the comparison between the concentration of the experimentally verified lncRNAs in the identification results of DAnet/DEA and the concentration in the entire lncRNAs expression. 3.2. Optimizing the K[CV] and CD parameters across the benchmark datasets In order to identify more likely disease-associated lncRNAs, optimization procedure was performed to determine the optimal K[CV] and CD across the benchmark datasets. As shown in [226]Fig. 3, the optimal K[CV] represented in red square was varied across the datasets (from 8 for TCGA-TC to 1000 for [227]GSE102556), and the CV of experimentally verified disease-associated lncRNAs was generally higher. [228]Table 2 shows the optimal K[CV] value across the datasets. Moreover, as shown in [229]Supplementary Fig. S3, the optimal CD represented in red square was different across the datasets (from 5 kb for 13 datasets to 400 kb for [230]GSE113524). [231]Table 2 shows the optimal CD across the datasets. For six datasets ([232]GSE127853, [233]GSE97210, [234]GSE113013, [235]GSE108660, [236]GSE141140, TCGA_TC), the CD was not available. Fig. 3. [237]Fig. 3 [238]Open in a new tab Optimization for the K[CV] across these benchmark datasets. X axis: the top number of lncRNAs with the higher variabilities, Y axis: the number of experimental verified lncRNA (N[exp]). When the number of lncRNA identified by SNPs (N[snp]) was less than 100, the K was equal to the N[snp], if else, the K was from 100 to N[snp] with gradient of 100. 3.3. The function of lncRNA in disease characterized by DAnet 3.3.1. KEGG enrichment analysis to character lncRNA function Moreover, the co-expression network of lncRNAs and neighboring mRNAs was constructed under the optimal K[CV] and CD by WGCNA for each dataset. The network of module (contains the most genes with significant correlation) were displayed by Cytoscape [239][48]. Four networks are shown in [240]Fig. 4 A-D as examples, the light-yellow square represented the lncRNA and the blue dot represented the co-expressed mRNA in the cis-lncRNA regulatory networks, red edge represented the association between disease-associated lncRNA and neighboring mRNA. Other 14 networks are shown in [241]Supplementary Fig. S4. For each dataset, the KEGG enrichment analysis was performed to character lncRNA function via the co-expressed mRNAs. A chord diagram was dawn for illustrating the significantly enriched pathways across different datasets ([242]Fig. 4 E). As shown in [243]Fig. 4 E, the enriched pathways reported to be associated with the disease studied were indicated in blue lines, and other pathways were shown in grey lines. The statistical results of disease-related pathways in each dataset are shown in [244]Fig. 4 F. As shown, the percentage of disease-associated pathways were differed from 40% to 100% across datasets. The detail ed descriptions on relevance between disease and pathways are provided in Supplementary Table S2. Fig. 4. [245]Fig. 4 [246]Open in a new tab The function of lncRNA in disease characterized by DAnet. A-D: co-expression network of module (contains the most genes with significant correlation) constructed by WGCNA for each dataset. A: [247]GSE113524, B: [248]GSE65705, C: [249]GSE131525, D: [250]GSE131526, green square: lncRNA, blue dot: mRNA. E: chord diagram of enriched pathways of 15 benchmark datasets (p less than 0.05). F: the statistic of diseases-associated pathways. 3.3.2. Association between lncRNAs identified by DAnet and the specific disease Finally, the relationships of lncRNAs and diseases were systemic manually searched. As illustrated in [251]Fig. 5, 41 directly diseases-associated lncRNAs were identified for most diseases (blue lines). In particular, 13 lncRNAs were identified for Alzheimer disease (orange square, 8A20), three for major depressive disorder (brown square, 6A70), four for schizophrenia (brown square, 6A20), 12 for myocardial infarction (blue square, BA41), two for atherosclerosis (blue square, BD40), six for asthma (pink square, CA23), one for lupus erythematosus (purple square, 4A40), one for ulcerative colitis (turquoise square, DD71), five for obesity (yellow square, 5B81), six for type-2 diabetes mellitus (yellow square, 5A11), three for colorectal cancer (green square, 2B91), six for breast cancer (green square, 2C6Z). The detailed descriptions on relevance between lncRNAs and the specific disease are provided in Supplementary Table S3. Fig. 5. [252]Fig. 5 [253]Open in a new tab Associations between lncRNAs identified by DAnet and the specific disease. The blue lines mean the reported associations between lncRNAs and diseases. The squares represent the type of diseases. The dots indicate lncRNAs identified by DAnet. Orange square: diseases of the nervous system; brown square: mental, behavioural or neurodevelopmental disorders; blue square: circulatory system disease; pink square: diseases of the respiratory system; purple square: diseases of the immune system; turquoise square: diseases of the digestive system; yellow square: endocrine, nutritional or metabolic diseases; green square: neoplasms; grey dot: lncRNA not reported in the studied disease; green dot: lncRNA associated with a single disease; red dot, lncRNA associated with multiple diseases. Meanwhile, as illustrated in [254]Fig. 5, the lncRNAs (red dots) associated with multiple diseases were identified. Specifically, two lncRNAs (LINC-PINT, GAS5) were associated both with Alzheimer disease and type-2 diabetes mellitus [255][52], [256][53], [257][54], [258][55], [259][56], SOX2-OT was associated with Alzheimer disease and asthma [260][57], [261][58], CCDC39 was associated with asthma and schizophrenia [262][59], [263][60], HCP5 was associated with asthma and breast cancer [264][61], [265][62], IFNG-AS1 was associated with asthma and ulcerative colitis [266][63], [267][64], CDKN2B-AS1 was associated with five diseases including Alzheimer disease, myocardial infarction, atherosclerosis, type-2 diabetes mellitus, and breast cancer [268][65], [269][66], [270][67], [271][68], [272][69], [273][70]. 4. Discussion Functional annotation of lncRNAs in diseases has attracted great attention for understanding disease etiology. In this study, we proposed a novel strategy termed DAnet by combining disease associations with cis-regulated network between lncRNAs and neighboring protein-coding genes for improving the functional annotation of lncRNAs. The strategy mainly consists of three procedures including: (1) identifying potential disease-associated lncRNAs based on disease-associated SNPs, (2) detecting more likely disease-associated lncRNAs based on expression variability, (3) developing cis-regulated networks between disease-associated lncRNAs and their neighboring protein-coding genes. To widen the scope of DAnet to other RNA-seq or Microarray data, the code of DAnet was provided in Supplementary Method S2. DAnet can be expected to identify the specific lncRNA function in the given disease. Primarily, based on the analysis of 24 datasets involving 16 diseases, the Rate value of DAnet was overall higher than the DEA, which indicates that the performance of DAnet could be better than traditional differential expression-based analysis on identification of experimentally validated lncRNA. In addition, the EF of DAnet was overall higher than the DEA. All EFs of DAnet were higher than 1. These findings indicate the superior capacity of DAnet in controlling the false characterization of lncRNA function. Furthermore, during the optimization procedure for determining the optimal K[CV], we found that the experimentally verified disease-associated lncRNAs were generally with higher CV values. This finding is consistent with those reported by other investigators [274][16], [275][17], [276][18]. Under the optimal K[CV], the optimal CD was not available for these six datasets ([277]GSE127853, [278]GSE97210, [279]GSE113013, [280]GSE108660, [281]GSE141140, TCGA_TC). This may be attributed to the effect of the small number of samples and the few numbers of lncRNAs/mRNAs in the co-expression analysis [282][71]. Finally, the KEGG enrichment results indicate most biological pathways identified by DAnet were associated with the corresponding disease (from 40% to 100%). And by DAnet, directly diseases-associated lncRNAs were identified for most diseases. Moreover, lncRNAs associated with multiple diseases were also identified. 5. Conclusions A new strategy integrating disease associations was developed for obtaining the lower false discovery rate in functional annotation of lncRNAs. The analysis of 24 datasets involving 16 diseases, indicated that the performance of DAnet could be better than traditional differential expression-based on identification of experimentally validated lncRNA, and the most biological pathways identified by DAnet were associated with the studied diseases. This provides a way to study the function of lncRNA in diseases from another aspect. In sum, DAnet is expected to identify the specific lncRNA function in the given disease. Contributors J.T. and Y.W. conceived the idea and supervised the work. Y.W., J.Z., and X.W., performed the research. Y.W., J.Z., X.W., Adu-Gyamfi E., L.Y., T.L., M.W., Y.D., and F.Z. prepared and analyzed the data. J.T. and Y.W. wrote manuscript. All authors reviewed and approved the final version of the manuscript. CRediT authorship contribution statement Yongheng Wang: Formal analysis, Writing – original draft, Writing – review & editing, Visualization. Jincheng Zhai: Formal analysis, Investigation. Xianglu Wu: Investigation, Visualization. Enoch Appiah Adu-Gyamfi: Validation, Writing – review & editing. Lingping Yang: Investigation, Validation. Taihang Liu: Validation. Meijiao Wang: Validation. Yubin Ding: Project administration. Feng Zhu: Conceptualization, Project administration. Yingxiong Wang: Conceptualization, Supervision, Funding acquisition. Jing Tang: Conceptualization, Writing – original draft, Writing – review & editing, Supervision. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgments