Abstract

   With the advent of high throughput technology, a huge amount of
   microRNA information has been added to the growing body of knowledge
   for non-coding RNAs. Here we present the Dietary MicroRNA Databases
   (DMD), the first repository for archiving and analyzing the published
   and novel microRNAs discovered in dietary resources. Currently there
   are fifteen types of dietary species, such as apple, grape, cow milk,
   and cow fat, included in the database originating from 9 plant and 5
   animal species. Annotation for each entry, a mature microRNA indexed as
   DM0000*, covers information of the mature sequences, genome locations,
   hairpin structures of parental pre-microRNAs, cross-species sequence
   comparison, disease relevance, and the experimentally validated gene
   targets. Furthermore, a few functional analyses including target
   prediction, pathway enrichment and gene network construction have been
   integrated into the system, which enable users to generate functional
   insights through viewing the functional pathways and building
   protein-protein interaction networks associated with each microRNA.
   Another unique feature of DMD is that it provides a feature generator
   where a total of 411 descriptive attributes can be calculated for any
   given microRNAs based on their sequences and structures. DMD would be
   particularly useful for research groups studying microRNA regulation
   from a nutrition point of view. The database can be accessed at
   [29]http://sbbi.unl.edu/dmd/.

Introduction

   Empowered by revolutionary sequencing technology, microRNAs have been
   extensively discovered in various dietary resources including plants
   (e.g. rice and tomato) and animals (e.g. milk and meats). Given the
   broad implications of microRNA in health and disease [[30]1–[31]8],
   research enthusiasm for functional impacts of exogenous food microRNA
   in human cellular phenotypes has soared, which warrants the efforts to
   build related bioinformatics tools and databases. The Dietary MicroRNA
   Database (DMD) represents the first repository in this domain for
   archiving and distributing the published food-borne microRNAs in
   literatures and public databases.

   There are several public databases focused on microRNA identification
   and targets prediction that archive validated microRNAs with sequence,
   structure and interaction information. For example, miRBase
   ([32]http://www.mirbase.org) records 64,473 microRNAs from 223 species
   [[33]9] and MiRecords [[34]10] hosts 2,705 records of interactions
   between 644 microRNAs and 1,901 target genes in 9 animal species.
   Databases such as TargetScan [[35]11], Miranda [[36]12] and MirTarBase
   [[37]13] provide information of the validated gene targets as well as
   the computationally predicted targets. For example, 60% of human genes
   are regulated by microRNAs, participating in many major cellular
   processes such as cell growth, differentiation and apoptosis [[38]14,
   [39]15]. In addition, microRNA expression data, although limited, are
   archived in public databases such as GEO databases [[40]16] and TCGA
   [[41]17]. However, none of the aforementioned databases cover dietary
   information that may represent new horizon in microRNA research. For
   example, miRBase has reported 808 microRNAs in bovine, whereas only 243
   of them have been found in cow milk [[42]18] and 213 in the fat of cow
   beef [[43]19]. Likewise, human breast milk only contains 434 microRNAs,
   out of the total of 2,588 microRNAs in human [[44]20]. We envision such
   diet-specific cohorts would be important for nutritionists and general
   biologists to investigate microRNA dietary intake and analyze
   subsequent regulations in human health and diseases. Expelling
   evidences sustaining our hypothesis include the following: it has been
   recently discovered that human can absorb certain exosomal microRNAs
   from cow’s milk, e.g., miR-29b and 200c, and that endogenous microRNA
   synthesis does not compensate for dietary deficiency [[45]21]; the
   biogenesis and function of such exogenous miRNAs are evidently health
   related [[46]21–[47]24]. However, while the evidence in support of
   bioavailability of milk miRNAs is unambiguous, a recent report that
   mammals can also absorb plant miRNAs (e.g. miR-168a) from rice [[48]25]
   was met with widespread skepticism [[49]26–[50]29]. Based on these
   evidences, challenging questions may be raised regarding how humans
   pick up microRNAs from diet and what are the broader roles played by
   such exogenous microRNAs in human disease processes.

   In order to facilitate more advanced research related to dietary
   microRNAs, DMD was developed as the first repository for archiving and
   analyzing the published microRNAs discovered in dietary plants and
   animals, such as cow milk, breast milk, grape, beef, pork, apple,
   banana and etc. For each reported microRNA, various types of
   information have been covered, including sequences, genome locations,
   hairpin structures of parental pre-microRNAs, disease relevance, and
   experimentally validated gene targets. We also integrate an analytical
   pipeline into this platform that includes cross-species sequence
   comparison, target prediction, gene enrichment analysis and
   microRNA-mediated gene network construction, which we will introduce in
   the following sections.

   Compared to other microRNA-related databases, DMD also has a few unique
   features. For example, a feature generation tool allows users to
   calculate a comprehensive set of molecular discriminators based on the
   sequences and structures of any microRNA entry in the database or
   uploaded on their own. These discriminators have been considered as
   important features for microRNA identification and microRNA-mRNA
   interaction prediction and have been employed by many current tools in
   addition to the use of complementary seed sequences as major motifs in
   animal and plant species [[51]11, [52]30–[53]34]. Based on the targets,
   one can extract the functional pathways information and infer the
   functional impacts of the microRNAs through their gene regulation
   [[54]35, [55]36]. In the later section, we will use a case study to
   demonstrate the usefulness of this database.

Materials and Methods

Database Construction and Access

   [56]Fig 1 shows the workflow of the data collection and analysis with
   DMD. Through literature and database search, we compiled and reported
   microRNAs from 15 types of dietary species. For each entry, the basic
   annotation page includes ten types of information including mature
   sequences, genome coordinates, pre-microRNA sequence, hairpin
   structure, cross-species sequence comparison, disease relevance, and
   the experimentally validated and predicted gene targets. For entries
   from public databases, e.g. miRBase, we have provided links to the
   external annotation pages.

Fig 1. DMD construction workflow and the outline of data content.

   [57]Fig 1
   [58]Open in a new tab

   The DMD was created using a MySQL database, consisting of 25 tables
   ([59]S1 Fig). The outline ([60]Fig 1) shows that the database content
   can be categorized into three areas, namely basics, annotation and
   analysis. First, many external databases are integrated within DMD to
   allow for quick viewing and annotation of microRNAs. Second, there are
   the prediction tools, which allow users to quickly view homologous
   microRNAs via the clustering analysis by CD-HIT [[61]37] and predict
   gene targets in their own species and in human. Finally, there is an
   intensive process to annotate microRNAs into dietary species and
   tissues.

   The [62]S1 Fig shows the database design, table relationships and
   indexing patterns among these tables. All information, including from
   external databases, to prediction tools and annotation, are heavily
   connected, as can be seen from the schema. This allows for the
   information within the database to be shared easily and quickly.

   In addition to the MySQL databases, the graph database Neo4j
   ([63]http://neo4j.org) was used to model protein-protein interaction.
   The graph database consists of two different types of nodes, a microRNA
   and a protein node, and two types of edges, a protein-protein
   interaction undirected edge, and a microRNA to protein regulation
   directed edge.

   The information of DMD can be freely accessed from
   [64]http://sbbi.unl.edu/dmd/. Data submission and download can be
   accessed through a secure user login system.

Cross-species Sequence Comparison

   In order to assess the sequence conservation of each microRNA during
   evolution, we conducted sequence alignment and comparison using CD-HIT
   [[65]37] where the microRNA with similar sequences can be grouped into
   the same clusters. In this analysis, each cluster represents a
   collection of microRNAs that share identical or highly similar
   sequences (with identity higher than 95% of the sequence length), which
   could originate from various species, e.g. homologous microRNAs. Within
   the cluster, the user will be able to view the microRNA name, sequence
   alignment, associated gene targets and diseases, along with the option
   of viewing information among diet-only or all species.

Querying the Database (Browsing and Searching)

   Users are able to browse microRNAs by species using the browse page.
   Accessing each species will output a whole list of microRNAs
   specifically discovered in that dietary species. For example, microRNAs
   under “cow milk” and “cow fat” are subsets of all known microRNAs from
   bovine organism. In addition to being able to browse by species, there
   are three methods of searching the datasets, by “ID” (DMD index number,
   e.g. DM00001), “Name” (microRNA name, e.g. bta-miR-29b or part of the
   name, e.g. 29b) or “Sequence”(either mature or pre-microRNA sequences,
   e.g. ugagguaguagguuguauaguu). Again, the search can be constrained
   within mature microRNAs or precursors according to the user defined
   criteria. All outputs from the search are organized initially by their
   unique DMD identifiers, but the results may be re-sorted by microRNA
   name, or sequence.

Experimental and Predicted Targets

   Experimental targets information was extracted from miRTarBase
   [[66]38], which contains 18 species and 51,460 miRNA-target
   interactions. A few of computational tools were included for target
   prediction, especially for species without validated target
   information, including MirTarget [[67]39], targetScan [[68]11] and
   psRNAtarget [[69]40]. Please note prediction could be specifically
   designed for certain organism, e.g. human specific or plant specific.
   All microRNA sequences will be also subject to a target prediction
   against human genome no matter whether predictions on their own genomes
   are available or not.

Functional Analysis Based on MicroRNA Targets

   According to the predicted and experimental targets for each microRNA
   entry, users can choose to run a pathway enrichment analysis on
   selected targets. 1,955 pathways from KEGG [[70]41] are included.
   Modified p-value was calculated for each relevant pathway based on
   Fisher’s exact test on queried targets against the whole genome. In
   addition to the pathway enrichment analysis, protein-protein
   interaction (PPI) network [[71]42] was employed to visualize the
   microRNA-mRNA regulation network.

Feature Construction

   As molecular properties of microRNA sequences and structures are key
   for target identification, we have developed a feature page to allow
   users to calculate for any given microRNA sequences a list of features
   categorized into two classes: sequence-based features and secondary
   structure features. Particularly, for each mature miRNA, features were
   generated on both mature sequences and the corresponding pre-miRNA
   sequences, such as existence of palindromic sequences, sequence length
   and the composition of monomers and dimers. Such features have been
   shown to be discriminants when used for machine learning [[72]18,
   [73]31, [74]32, [75]43–[76]46].

   Secondary structure features were calculated based on the pre-miRNA
   sequences. For example, RNAfold [[77]47] was employed to predict
   secondary structure and calculate Minimum Free Energy (MFE) [[78]48].
   Based on the predicted structure of pre-miRNA, 32 triplet features and
   11 base-paired features were calculated, such as A ((((the frequency of
   3 paired nucleotides leading with A) and %pairGC (percentage of the
   paired G-C bases). Additionally, RNAshape was used to map secondary
   structures to tree-like domain of shapes, retaining adjacency and
   nesting of structural features, but disregarding helix length [[79]49].
   STOAT, packaged in the NOBAI web server, was utilized to compute
   Shannon Entropy (Q) and Frobenius Norm (F) [[80]50]. See [81]Table 1
   for a complete list of features.

Table 1. List of features available for generation.

   Category Feature Details Feature Dimensions Reference
   Primary Sequence Single Nucleotide Frequency 4 x 3[82] ^1 [[83]31,
   [84]46]
   Pairwise Nucleotide Frequency 16 x 3[85] ^1 [[86]31, [87]32, [88]43,
   [89]46]
   Triplet Nucleotide Frequency 64 x 3[90] ^1 [[91]31, [92]43, [93]46]
   Quadruplet Nucleotide Frequency 256 x 3[94] ^1 [[95]46]
   A + U Frequency 1 x 3[96] ^1 [[97]43, [98]44]
   G + C Frequency 1 x 3[99] ^1 [[100]31, [101]32, [102]43, [103]44,
   [104]46]
   G + U Frequency 1 x 3 [105]^1 [[106]43, [107]44, [108]46]
   Number of Palindromes in Sequence 1 x 3[109] ^1 [[110]45]
   Length 1 x 3[111] ^1 [[112]31]
   Pairs of A-U in Premature microRNA 1 [[113]43]
   Pairs of G-C in Premature microRNA 1 [[114]43, [115]44]
   Pairs of G-U in Premature microRNA 1 [[116]43]
   Secondary Structure Nucleotide to RNAfold[117] ^2 triplet match. (A(((,
   C(.(, G(… etc…) 32 [[118]44, [119]46]
   Minimum Free Energy, Normalized Minimum Free Energy, Frequency of
   Minimum Free Energy Structures 3 [[120]31, [121]32, [122]43, [123]44]
   Ensemble Free Energy, Normalized Ensemble Free Energy 2 [[124]32,
   [125]44]
   Stem Statistics (Stems, Average Stem Length, Maximum Stem Length, Stem
   containing AU, Stem containing GC, Stem containing GU) 6 [[126]32,
   [127]43, [128]44]
   Minimum Free Energy Statistics (mfe/G+C frequency, mfe/stems,
   mfe/unpaired nucleotides, mfe/paired nucleotides, difference in mfe and
   efe, and ensemble diversity). 6 [[129]32, [130]43, [131]44]
   Percentage of sequence composing of pairs. 1 [[132]46]
   Frequency of Nucleotides that occur outside of UA, GU, GC pairs. 4
   [[133]46]
   Predicted shape type probability base on RNAshapes[134] ^3 . 5
   [[135]51]
   STOAT[136] ^4 statistics (Shannon Entropy, Frobenius Norm, Base-pairing
   propensity, and mean stem length) 4 [[137]32]
   [138]Open in a new tab

   ^1These features may be calculated for the premature sequence, mature
   sequence, and seed region sequence.

   ^2RNAfold is an external tool that is run with the—p option to generate
   the partition function and base pairing probability.

   ^3RNAshapes is an external tool that is run with the—t option to
   specify 5 different shape types.

   ^4STOAT is an external tool that is run with the—x 31 option to signify
   31 character states and the—v option to display a verbose option that
   is easier to parse.

Results and Discussion

   As a dietary microRNA database, DMD acts as the first repository
   archiving microRNA sequence and annotation that are related to any
   dietary species. Currently there are 15 dietary species have been
   curated in DMD, including five animal species (human, chicken, cow, pig
   and salmon) and nine plant species (soybean, tomato, corn, apple,
   orange, banana, grape, rice, and wheat). Please note that dietary
   species might originate from the same biological organism. For example,
   the bovine microRNAs are organized under different dietary groups,
   including cow milk and cow fat. [139]Table 2 shows the statistics of
   different types of information archived for each species.

Table 2. Statistics of microRNAs and species in DMD.

   Types Species Mature [[140]9] Precursor [[141]9] Exp. Target Pred.
   Target [[142]38, [143]40]
   #. of dietary miRNAs/#. of known miRNAs in the organism [[144]38]
   Animal Human Breastmilk 434/ 2588 402 / 1881 9995 17,416
   Cow Milk 243 / 793 245 / 808 3 16,451
   Cow Fat 205 / 793 229 / 808 5 16,269
   Atlantic Salmon 498 / 498 371 / 371 - 17,319
   Chicken 994 / 994 740 / 740 19 18,075
   Pig 411 / 411 382 / 382 - 17,406
   Plant Apple 203 / 207 202 / 206 - 13,098
   Banana 360 / 360 180 / 180 - 16,537
   Corn 309 / 321 166 / 172 - 16,456
   Grape 108 / 186 157 / 163 - 12,236
   Orange 047 / 064 057 / 060 - 12,655
   Rice 634 / 713 526 / 592 - 21,355
   Soybean 620 / 639 554 / 573 - 20,261
   Tomato 040 / 110 068 / 77 - 10,827
   Wheat 111 / 119 108 / 116 - 15,327
   [145]Open in a new tab

Database Content

   The information stored in DMD is categorized into Sequences and
   Annotation.

   Sequences: Currently, there are 11,569 microRNA entries in DMD,
   including 5,217 unique mature sequences and 5,865 unique pre-microRNA
   sequences. Duplicates contribute to the total microRNA, which is due to
   a microRNA being present in multiple dietary species. DMD follows the
   same naming standard for each entry, e.g. microRNA gene names have the
   form hsa-miR-200, consistent with other databases, e.g. miRBase. The
   prefix signifies the organism, in this case Homo Sapiens. Each entry in
   the database, indexed as DM0000*, represents the mature sequence, with
   the information on the genomic location and hairpin sequence of the
   parental pre-microRNA, indexed as DP0000*, which will have the
   corresponding miRBase index if entries from both database are the same.
   Homologous microRNA loci in different species are assigned the same
   number. Paralogous microRNAs are assigned names with lettered and
   numbered suffixes, depending on whether the derived mature microRNA is
   identical in sequence, or contains sequence differences. The derived
   mature microRNAs were previously assigned names of the form dme-miR-100
   and dme-miR-100*, for the guide and passenger strand, respectively
   while hsa-miR-100-5p and hsa-miR-100-3p were assigned for sequences
   derived from the 5’ and 3’ arms of the hsa-miR-100 hairpin precursor.

   Annotation: In addition to the general information such as sequence and
   structure, DMD has also generated features for each of the microRNAs.
   The feature page provides users with 411 molecular attributes that can
   be calculated based on the microRNA sequences and structures.
   [146]Table 1 lists all the features included in this study, and also
   presents a non-exhaustive list of studies that have reported use of a
   particular feature. Other annotation information covers targets,
   sequence comparison against other species, and pathway and interaction
   network analysis based on the either experimentally validated targets
   or computational predicted targets. Specifically, the pathway
   information was compiled from KEGG. PPI information has also been used
   to visualize the interaction network based on the given targets.
   Furthermore, microRNA disease information was extracted from mirCancer
   and obesity, the Human microRNA Disease Database [[147]52], PhenomiR
   [[148]53], and PubMed literature search.

   The browse page allows users to access microRNAs under each dietary
   species while the database metadata can be downloaded as plain text.
   For each microRNA sequence entry, there are links to other databases
   providing the primary references that describe its discovery, links to