Abstract With the advent of high throughput technology, a huge amount of microRNA information has been added to the growing body of knowledge for non-coding RNAs. Here we present the Dietary MicroRNA Databases (DMD), the first repository for archiving and analyzing the published and novel microRNAs discovered in dietary resources. Currently there are fifteen types of dietary species, such as apple, grape, cow milk, and cow fat, included in the database originating from 9 plant and 5 animal species. Annotation for each entry, a mature microRNA indexed as DM0000*, covers information of the mature sequences, genome locations, hairpin structures of parental pre-microRNAs, cross-species sequence comparison, disease relevance, and the experimentally validated gene targets. Furthermore, a few functional analyses including target prediction, pathway enrichment and gene network construction have been integrated into the system, which enable users to generate functional insights through viewing the functional pathways and building protein-protein interaction networks associated with each microRNA. Another unique feature of DMD is that it provides a feature generator where a total of 411 descriptive attributes can be calculated for any given microRNAs based on their sequences and structures. DMD would be particularly useful for research groups studying microRNA regulation from a nutrition point of view. The database can be accessed at [29]http://sbbi.unl.edu/dmd/. Introduction Empowered by revolutionary sequencing technology, microRNAs have been extensively discovered in various dietary resources including plants (e.g. rice and tomato) and animals (e.g. milk and meats). Given the broad implications of microRNA in health and disease [[30]1–[31]8], research enthusiasm for functional impacts of exogenous food microRNA in human cellular phenotypes has soared, which warrants the efforts to build related bioinformatics tools and databases. The Dietary MicroRNA Database (DMD) represents the first repository in this domain for archiving and distributing the published food-borne microRNAs in literatures and public databases. There are several public databases focused on microRNA identification and targets prediction that archive validated microRNAs with sequence, structure and interaction information. For example, miRBase ([32]http://www.mirbase.org) records 64,473 microRNAs from 223 species [[33]9] and MiRecords [[34]10] hosts 2,705 records of interactions between 644 microRNAs and 1,901 target genes in 9 animal species. Databases such as TargetScan [[35]11], Miranda [[36]12] and MirTarBase [[37]13] provide information of the validated gene targets as well as the computationally predicted targets. For example, 60% of human genes are regulated by microRNAs, participating in many major cellular processes such as cell growth, differentiation and apoptosis [[38]14, [39]15]. In addition, microRNA expression data, although limited, are archived in public databases such as GEO databases [[40]16] and TCGA [[41]17]. However, none of the aforementioned databases cover dietary information that may represent new horizon in microRNA research. For example, miRBase has reported 808 microRNAs in bovine, whereas only 243 of them have been found in cow milk [[42]18] and 213 in the fat of cow beef [[43]19]. Likewise, human breast milk only contains 434 microRNAs, out of the total of 2,588 microRNAs in human [[44]20]. We envision such diet-specific cohorts would be important for nutritionists and general biologists to investigate microRNA dietary intake and analyze subsequent regulations in human health and diseases. Expelling evidences sustaining our hypothesis include the following: it has been recently discovered that human can absorb certain exosomal microRNAs from cow’s milk, e.g., miR-29b and 200c, and that endogenous microRNA synthesis does not compensate for dietary deficiency [[45]21]; the biogenesis and function of such exogenous miRNAs are evidently health related [[46]21–[47]24]. However, while the evidence in support of bioavailability of milk miRNAs is unambiguous, a recent report that mammals can also absorb plant miRNAs (e.g. miR-168a) from rice [[48]25] was met with widespread skepticism [[49]26–[50]29]. Based on these evidences, challenging questions may be raised regarding how humans pick up microRNAs from diet and what are the broader roles played by such exogenous microRNAs in human disease processes. In order to facilitate more advanced research related to dietary microRNAs, DMD was developed as the first repository for archiving and analyzing the published microRNAs discovered in dietary plants and animals, such as cow milk, breast milk, grape, beef, pork, apple, banana and etc. For each reported microRNA, various types of information have been covered, including sequences, genome locations, hairpin structures of parental pre-microRNAs, disease relevance, and experimentally validated gene targets. We also integrate an analytical pipeline into this platform that includes cross-species sequence comparison, target prediction, gene enrichment analysis and microRNA-mediated gene network construction, which we will introduce in the following sections. Compared to other microRNA-related databases, DMD also has a few unique features. For example, a feature generation tool allows users to calculate a comprehensive set of molecular discriminators based on the sequences and structures of any microRNA entry in the database or uploaded on their own. These discriminators have been considered as important features for microRNA identification and microRNA-mRNA interaction prediction and have been employed by many current tools in addition to the use of complementary seed sequences as major motifs in animal and plant species [[51]11, [52]30–[53]34]. Based on the targets, one can extract the functional pathways information and infer the functional impacts of the microRNAs through their gene regulation [[54]35, [55]36]. In the later section, we will use a case study to demonstrate the usefulness of this database. Materials and Methods Database Construction and Access [56]Fig 1 shows the workflow of the data collection and analysis with DMD. Through literature and database search, we compiled and reported microRNAs from 15 types of dietary species. For each entry, the basic annotation page includes ten types of information including mature sequences, genome coordinates, pre-microRNA sequence, hairpin structure, cross-species sequence comparison, disease relevance, and the experimentally validated and predicted gene targets. For entries from public databases, e.g. miRBase, we have provided links to the external annotation pages. Fig 1. DMD construction workflow and the outline of data content. [57]Fig 1 [58]Open in a new tab The DMD was created using a MySQL database, consisting of 25 tables ([59]S1 Fig). The outline ([60]Fig 1) shows that the database content can be categorized into three areas, namely basics, annotation and analysis. First, many external databases are integrated within DMD to allow for quick viewing and annotation of microRNAs. Second, there are the prediction tools, which allow users to quickly view homologous microRNAs via the clustering analysis by CD-HIT [[61]37] and predict gene targets in their own species and in human. Finally, there is an intensive process to annotate microRNAs into dietary species and tissues. The [62]S1 Fig shows the database design, table relationships and indexing patterns among these tables. All information, including from external databases, to prediction tools and annotation, are heavily connected, as can be seen from the schema. This allows for the information within the database to be shared easily and quickly. In addition to the MySQL databases, the graph database Neo4j ([63]http://neo4j.org) was used to model protein-protein interaction. The graph database consists of two different types of nodes, a microRNA and a protein node, and two types of edges, a protein-protein interaction undirected edge, and a microRNA to protein regulation directed edge. The information of DMD can be freely accessed from [64]http://sbbi.unl.edu/dmd/. Data submission and download can be accessed through a secure user login system. Cross-species Sequence Comparison In order to assess the sequence conservation of each microRNA during evolution, we conducted sequence alignment and comparison using CD-HIT [[65]37] where the microRNA with similar sequences can be grouped into the same clusters. In this analysis, each cluster represents a collection of microRNAs that share identical or highly similar sequences (with identity higher than 95% of the sequence length), which could originate from various species, e.g. homologous microRNAs. Within the cluster, the user will be able to view the microRNA name, sequence alignment, associated gene targets and diseases, along with the option of viewing information among diet-only or all species. Querying the Database (Browsing and Searching) Users are able to browse microRNAs by species using the browse page. Accessing each species will output a whole list of microRNAs specifically discovered in that dietary species. For example, microRNAs under “cow milk” and “cow fat” are subsets of all known microRNAs from bovine organism. In addition to being able to browse by species, there are three methods of searching the datasets, by “ID” (DMD index number, e.g. DM00001), “Name” (microRNA name, e.g. bta-miR-29b or part of the name, e.g. 29b) or “Sequence”(either mature or pre-microRNA sequences, e.g. ugagguaguagguuguauaguu). Again, the search can be constrained within mature microRNAs or precursors according to the user defined criteria. All outputs from the search are organized initially by their unique DMD identifiers, but the results may be re-sorted by microRNA name, or sequence. Experimental and Predicted Targets Experimental targets information was extracted from miRTarBase [[66]38], which contains 18 species and 51,460 miRNA-target interactions. A few of computational tools were included for target prediction, especially for species without validated target information, including MirTarget [[67]39], targetScan [[68]11] and psRNAtarget [[69]40]. Please note prediction could be specifically designed for certain organism, e.g. human specific or plant specific. All microRNA sequences will be also subject to a target prediction against human genome no matter whether predictions on their own genomes are available or not. Functional Analysis Based on MicroRNA Targets According to the predicted and experimental targets for each microRNA entry, users can choose to run a pathway enrichment analysis on selected targets. 1,955 pathways from KEGG [[70]41] are included. Modified p-value was calculated for each relevant pathway based on Fisher’s exact test on queried targets against the whole genome. In addition to the pathway enrichment analysis, protein-protein interaction (PPI) network [[71]42] was employed to visualize the microRNA-mRNA regulation network. Feature Construction As molecular properties of microRNA sequences and structures are key for target identification, we have developed a feature page to allow users to calculate for any given microRNA sequences a list of features categorized into two classes: sequence-based features and secondary structure features. Particularly, for each mature miRNA, features were generated on both mature sequences and the corresponding pre-miRNA sequences, such as existence of palindromic sequences, sequence length and the composition of monomers and dimers. Such features have been shown to be discriminants when used for machine learning [[72]18, [73]31, [74]32, [75]43–[76]46]. Secondary structure features were calculated based on the pre-miRNA sequences. For example, RNAfold [[77]47] was employed to predict secondary structure and calculate Minimum Free Energy (MFE) [[78]48]. Based on the predicted structure of pre-miRNA, 32 triplet features and 11 base-paired features were calculated, such as A ((((the frequency of 3 paired nucleotides leading with A) and %pairGC (percentage of the paired G-C bases). Additionally, RNAshape was used to map secondary structures to tree-like domain of shapes, retaining adjacency and nesting of structural features, but disregarding helix length [[79]49]. STOAT, packaged in the NOBAI web server, was utilized to compute Shannon Entropy (Q) and Frobenius Norm (F) [[80]50]. See [81]Table 1 for a complete list of features. Table 1. List of features available for generation. Category Feature Details Feature Dimensions Reference Primary Sequence Single Nucleotide Frequency 4 x 3[82] ^1 [[83]31, [84]46] Pairwise Nucleotide Frequency 16 x 3[85] ^1 [[86]31, [87]32, [88]43, [89]46] Triplet Nucleotide Frequency 64 x 3[90] ^1 [[91]31, [92]43, [93]46] Quadruplet Nucleotide Frequency 256 x 3[94] ^1 [[95]46] A + U Frequency 1 x 3[96] ^1 [[97]43, [98]44] G + C Frequency 1 x 3[99] ^1 [[100]31, [101]32, [102]43, [103]44, [104]46] G + U Frequency 1 x 3 [105]^1 [[106]43, [107]44, [108]46] Number of Palindromes in Sequence 1 x 3[109] ^1 [[110]45] Length 1 x 3[111] ^1 [[112]31] Pairs of A-U in Premature microRNA 1 [[113]43] Pairs of G-C in Premature microRNA 1 [[114]43, [115]44] Pairs of G-U in Premature microRNA 1 [[116]43] Secondary Structure Nucleotide to RNAfold[117] ^2 triplet match. (A(((, C(.(, G(… etc…) 32 [[118]44, [119]46] Minimum Free Energy, Normalized Minimum Free Energy, Frequency of Minimum Free Energy Structures 3 [[120]31, [121]32, [122]43, [123]44] Ensemble Free Energy, Normalized Ensemble Free Energy 2 [[124]32, [125]44] Stem Statistics (Stems, Average Stem Length, Maximum Stem Length, Stem containing AU, Stem containing GC, Stem containing GU) 6 [[126]32, [127]43, [128]44] Minimum Free Energy Statistics (mfe/G+C frequency, mfe/stems, mfe/unpaired nucleotides, mfe/paired nucleotides, difference in mfe and efe, and ensemble diversity). 6 [[129]32, [130]43, [131]44] Percentage of sequence composing of pairs. 1 [[132]46] Frequency of Nucleotides that occur outside of UA, GU, GC pairs. 4 [[133]46] Predicted shape type probability base on RNAshapes[134] ^3 . 5 [[135]51] STOAT[136] ^4 statistics (Shannon Entropy, Frobenius Norm, Base-pairing propensity, and mean stem length) 4 [[137]32] [138]Open in a new tab ^1These features may be calculated for the premature sequence, mature sequence, and seed region sequence. ^2RNAfold is an external tool that is run with the—p option to generate the partition function and base pairing probability. ^3RNAshapes is an external tool that is run with the—t option to specify 5 different shape types. ^4STOAT is an external tool that is run with the—x 31 option to signify 31 character states and the—v option to display a verbose option that is easier to parse. Results and Discussion As a dietary microRNA database, DMD acts as the first repository archiving microRNA sequence and annotation that are related to any dietary species. Currently there are 15 dietary species have been curated in DMD, including five animal species (human, chicken, cow, pig and salmon) and nine plant species (soybean, tomato, corn, apple, orange, banana, grape, rice, and wheat). Please note that dietary species might originate from the same biological organism. For example, the bovine microRNAs are organized under different dietary groups, including cow milk and cow fat. [139]Table 2 shows the statistics of different types of information archived for each species. Table 2. Statistics of microRNAs and species in DMD. Types Species Mature [[140]9] Precursor [[141]9] Exp. Target Pred. Target [[142]38, [143]40] #. of dietary miRNAs/#. of known miRNAs in the organism [[144]38] Animal Human Breastmilk 434/ 2588 402 / 1881 9995 17,416 Cow Milk 243 / 793 245 / 808 3 16,451 Cow Fat 205 / 793 229 / 808 5 16,269 Atlantic Salmon 498 / 498 371 / 371 - 17,319 Chicken 994 / 994 740 / 740 19 18,075 Pig 411 / 411 382 / 382 - 17,406 Plant Apple 203 / 207 202 / 206 - 13,098 Banana 360 / 360 180 / 180 - 16,537 Corn 309 / 321 166 / 172 - 16,456 Grape 108 / 186 157 / 163 - 12,236 Orange 047 / 064 057 / 060 - 12,655 Rice 634 / 713 526 / 592 - 21,355 Soybean 620 / 639 554 / 573 - 20,261 Tomato 040 / 110 068 / 77 - 10,827 Wheat 111 / 119 108 / 116 - 15,327 [145]Open in a new tab Database Content The information stored in DMD is categorized into Sequences and Annotation. Sequences: Currently, there are 11,569 microRNA entries in DMD, including 5,217 unique mature sequences and 5,865 unique pre-microRNA sequences. Duplicates contribute to the total microRNA, which is due to a microRNA being present in multiple dietary species. DMD follows the same naming standard for each entry, e.g. microRNA gene names have the form hsa-miR-200, consistent with other databases, e.g. miRBase. The prefix signifies the organism, in this case Homo Sapiens. Each entry in the database, indexed as DM0000*, represents the mature sequence, with the information on the genomic location and hairpin sequence of the parental pre-microRNA, indexed as DP0000*, which will have the corresponding miRBase index if entries from both database are the same. Homologous microRNA loci in different species are assigned the same number. Paralogous microRNAs are assigned names with lettered and numbered suffixes, depending on whether the derived mature microRNA is identical in sequence, or contains sequence differences. The derived mature microRNAs were previously assigned names of the form dme-miR-100 and dme-miR-100*, for the guide and passenger strand, respectively while hsa-miR-100-5p and hsa-miR-100-3p were assigned for sequences derived from the 5’ and 3’ arms of the hsa-miR-100 hairpin precursor. Annotation: In addition to the general information such as sequence and structure, DMD has also generated features for each of the microRNAs. The feature page provides users with 411 molecular attributes that can be calculated based on the microRNA sequences and structures. [146]Table 1 lists all the features included in this study, and also presents a non-exhaustive list of studies that have reported use of a particular feature. Other annotation information covers targets, sequence comparison against other species, and pathway and interaction network analysis based on the either experimentally validated targets or computational predicted targets. Specifically, the pathway information was compiled from KEGG. PPI information has also been used to visualize the interaction network based on the given targets. Furthermore, microRNA disease information was extracted from mirCancer and obesity, the Human microRNA Disease Database [[147]52], PhenomiR [[148]53], and PubMed literature search. The browse page allows users to access microRNAs under each dietary species while the database metadata can be downloaded as plain text. For each microRNA sequence entry, there are links to other databases providing the primary references that describe its discovery, links to