Abstract

Background

   Despite the unprecedented and increasing amount of data, relatively
   little progress has been made in molecular characterization of
   mechanisms underlying Parkinson’s disease. In the area of Parkinson’s
   research, there is a pressing need to integrate various pieces of
   information into a meaningful context of presumed disease mechanism(s).
   Disease ontologies provide a novel means for organizing, integrating,
   and standardizing the knowledge domains specific to disease in a
   compact, formalized and computer-readable form and serve as a reference
   for knowledge exchange or systems modeling of disease mechanism.

Methods

   The Parkinson’s disease ontology was built according to the life cycle
   of ontology building. Structural, functional, and expert evaluation of
   the ontology was performed to ensure the quality and usability of the
   ontology. A novelty metric has been introduced to measure the gain of
   new knowledge using the ontology. Finally, a cause-and-effect model was
   built around PINK1 and two gene expression studies from the Gene
   Expression Omnibus database were re-annotated to demonstrate the
   usability of the ontology.

Results

   The Parkinson’s disease ontology with a subclass-based taxonomic
   hierarchy covers the broad spectrum of major biomedical concepts from
   molecular to clinical features of the disease, and also reflects
   different views on disease features held by molecular biologists,
   clinicians and drug developers. The current version of the ontology
   contains 632 concepts, which are organized under nine views. The
   structural evaluation showed the balanced dispersion of concept classes
   throughout the ontology. The functional evaluation demonstrated that
   the ontology-driven literature search could gain novel knowledge not
   present in the reference Parkinson’s knowledge map. The ontology was
   able to answer specific questions related to Parkinson’s when evaluated
   by experts. Finally, the added value of the Parkinson’s disease
   ontology is demonstrated by ontology-driven modeling of PINK1 and
   re-annotation of gene expression datasets relevant to Parkinson’s
   disease.

Conclusions

   Parkinson’s disease ontology delivers the knowledge domain of
   Parkinson’s disease in a compact, computer-readable form, which can be
   further edited and enriched by the scientific community and also to be
   used to construct, represent and automatically extend
   Parkinson’s-related computable models. A practical version of the
   Parkinson’s disease ontology for browsing and editing can be publicly
   accessed at [45]http://bioportal.bioontology.org/ontologies/PDON.

Electronic supplementary material

   The online version of this article (doi:10.1186/s12976-015-0017-y)
   contains supplementary material, which is available to authorized
   users.

   Keywords: Parkinson’s disease, ontology, disease modeling, knowledge
   engineering

Background

   Parkinson’s disease (PD), a progressive movement disorder, is the
   second most common neurodegenerative disease [[46]1]. The molecular
   etiology of sporadic PD has not been resolved yet and therefore PD is
   often called an “idiopathic” disease. In recent years, several attempts
   at elucidating the molecular etiology of PD have generated large omics
   data sets [[47]2]. The emerging systems view on the pathology of
   neurodegenerative diseases (NDDs) requires an efficient strategy to
   aggregate, standardize, represent, and communicate biomedical
   information through controlled vocabularies and ontologies [[48]3]. An
   ontology is defined as “an explicit specification of a
   conceptualization”, which aims to facilitate knowledge sharing [[49]4].

   Ontologies are the basis for automated reasoning [[50]5], for
   large-scale annotation of entire genomes [[51]6, [52]7], for data
   mining in microarray data [[53]8], for prediction of biomolecular
   interactions [[54]9], and for semantic and ontological search in poorly
   structured information sources [[55]10, [56]11].

   A large portfolio of widely accepted and widely used ontologies
   including Gene Ontology [[57]7], the Sequence Ontology [[58]12] and the
   Microarray Gene Expression Database Ontology [[59]8] has evolved in the
   life sciences. Gene ontology (GO) is the most frequently used ontology
   in biomedical sciences, which provides standard functional annotations
   for genes and gene products. Although GO has facilitated understanding
   of high-throughput results by means of enrichment analysis, one of its
   significant limitations is that it does not capture domain-specific
   biological complexity [[60]13]. For example, GO is devoid of any
   disease-specific context. It can not be used for answering questions
   like “which disease subtypes or syndromes are over-represented in my
   gene or protein set?” Hence, a more useful GO ideally should contain:
   i. disease-specific annotations, ii. disease-specific categories, and
   iii. semantics that cover disease knowledge domains.

   To include disease-specific biological processes, functionalities, and
   categories, disease-specific ontologies that cover a broad spectrum of
   relevant knowledge are required. Disease ontologies may reference
   source terminologies and vocabularies with a hierarchical concept
   classification such as the SNOMED CT nomenclature [[61]14], the ICD
   ontology [[62]15] and the human disease ontology [[63]16]. These
   ontologies contain human disease concepts but their high-level, broad
   coverage and the lack of depth in these ontologies restrict their usage
   for specific disease domains. Malhotra and colleagues addressed this
   issue in the area of NDDs by construction of Alzheimer’s disease
   ontology (ADO) to cover clinical, etiological, molecular and cellular
   mechanism aspects of AD [[64]17]. The aim of developing ADO was to
   enable semantic mining of patient records and literature for effective
   retrieval and extraction of accurate AD-related information, which
   could be used for modeling disease processes.

   Similarly, there is a need for organizing the knowledge domain of PD.
   In response to this unmet need, we aimed at creating a disease ontology
   for PD (PDON) that spans from the molecular biology of the disease to
   clinical readouts. PDON has been created with a subclass-based
   hierarchy that – for the majority of concepts - uses subsumption
   relation (i.e. is_a). However, based on demand from biomedical experts
   for richer relations, partonomy relation (i.e. part_of) was also
   introduced, as a concession to better usability of the ontology by
   interdisciplinary experts. We demonstrate the ontology's usability
   through our use cases, keeping in mind that, re-usability by other
   teams is an important aspect in ontology construction and its adoption
   by the community. As a proof of concept, we performed evaluation of
   PDON performance by measuring its ability to recover pre-existing,
   expert-curated information from the knowledge space of PD in the
   literature with the aim of generating novel insights and hypotheses.

   The power of the ontology can be applied to several scenarios, e.g.
   building recommendation systems by mapping drug failure events to
   mechanisms and stages of disease/stages of drug discovery, or
   distinguishing proven facts from hypotheses and speculations [[65]18].
   Furthermore, in the emerging era of systems analysis of NDDs, such a
   knowledge-driven approach is expected to support the integration of
   multiscale and multilevel information across different biological
   scales, from molecular networks to clinical readouts.

   To this end, we have proposed a model-driven approach to integrating
   biomedical knowledge and data into mechanistic models that represent
   cause-and-effect relationships among molecular entities, biological
   processes and their corresponding clinical outcomes. Using this
   strategy in the current study, we demonstrate how PDON can be utilized
   not only to causally link molecular etiology of PD to impaired
   biological processes and their corresponding disease outcome
   (Application scenario 1) but also to annotate experimental datasets
   with their corresponding knowledge description for further integration
   into disease models (Application scenario 2).

Results

   The purpose of the PDON is to communicate and share the PD knowledge in
   a standard form and support text-mining and knowledge discovery.
   Furthermore, for the construction of a large, integrative knowledge
   base on neurodegenerative diseases, PDON can be used for metadata
   annotation of various omics data sets available in the public domain.
   The PDON encompasses clinical and non-clinical aspects of PD and is
   expected to support retrieval of information on syndromes, etiology,
   treatment, experimental models, diagnosis and symptoms of PD
   (Fig. [66]1).

Fig. 1.

   Fig. 1
   [67]Open in a new tab

   Upper-level classes of PDON as represented in the Protégé ontology
   editor software. Super-classes represent different biological views
   (perspectives) suggested by experts under which PD-specific knowledge
   is modeled

   PDON represents a range of key concepts specific to the knowledge
   domain of Parkinson’s disease through different views, which have been
   integrated in the ontology. Views are root super-classes that organize
   concepts within a certain knowledge domain, as they are realized and
   seen by experts in reality. PDON is represented by nine views:

   The view ‘Clinical aspects’ describes a broad range of motor and
   non-motor features displayed by PD patients. These features have been
   classified into three upper-level categories that capture clinical
   concepts related to “diagnostics”, “symptomatology” and “treatment” of
   PD.

   The view ‘Etiology’ captures both genetic and environmental factors
   that are known to cause familiar PD and Parkinsonism due to toxic
   dopaminergic cell death, respectively. Toxic and infectious agents as
   well as genetic mutations are classified within this view; in addition,
   epidemiologically confirmed factors such as smoking and pesticides have
   been also included.

   The view ‘Model of Parkinson’s disease’ contains various in-vivo and
   in-vitro disease models that are in routine use in PD research.

   The view ‘Neuropathology’ was included to highlight two prototypic
   hypotheses of PD-related mechanisms, namely synucleopathy and the
   emerging tauopathy. It is expected that this view is populated further
   and enriched with more neuropathological concepts by the PD research
   community.

   The view ‘Familial neurodegenerative disease’ includes those hereditary
   disorders that are clinically associated with PD, such as Huntington’s
   and Wilson’s diseases.

   ‘Idiopathic Parkinson’s disease’, ‘Primary parkinsonism’, ‘Secondary
   parkinsonism’, and ‘Parkinson-plus Syndrome’ represent four separate
   views as per recommendation of the clinical expert panel. These views
   provide a categorized overview on distinct syndromes associated with PD
   based on their origin of cause. For example, the primary parkinsonism
   class represents parkinsonian syndromes for which a definite cause has
   been identified (e.g. mutations in PARK genes), whereas secondary
   parkinsonism syndromes are induced by a hypothetical cause that is
   potentially identifiable. Those syndromes with unknown causative factor
   have been clinically assigned to the Parkinson-plus view.

   In PDON, each concept class is supported by a scientific definition, a
   valid scientific reference (if available) and existing synonyms
   (Fig. [68]2). Definitions have been selected from review papers,
   journal articles and handbooks with consideration of the consensus
   definitions accepted in the PD research community. It is noteworthy
   that the PDON is expected to grow over time by inclusion of missing or
   emerging concepts. Due to dynamic research in the PD field, the
   structure of ontology is subject to change. We do explicitly invite
   experts in the field to critically review, revise and optimize the
   draft ontology presented in this manuscript. The ontology will be
   updated based on the feedback collected from experts, which includes
   concept edition, re-defining concepts with missing or insufficient
   explanation, or new relationship proposal. This is accomplished through
   the possibility of adding comments or proposals to the ontology’s
   webpage on the BioPortal repository. The Bioinformatics group at
   Fraunhofer Institute SCAI that owns the ontology collects these
   feedbacks and manages the updated releases. The ontology can be freely
   accessed and downloaded at
   [69]http://www.scai.fraunhofer.de/en/business-research-areas/bioinforma
   tics/downloads.html.

Fig. 2.

   Fig. 2
   [70]Open in a new tab

   A snapshot of the annotation field for PDON concepts as presented in
   the Protégé ontology editor. Each PDON concept has been annotated with
   definition, reference, and synonyms

Structural evaluation:

   PDON was evaluated for structural features reflecting its topology and
   logical properties. The high-level semantic framework of PDON contains
   nine super-classes, followed by sub-classes that specifically capture
   the sub-domain knowledge of PD. PDON was characterized for its
   structural features using parameters that have been summarized in
   Table [71]1. As shown in Table [72]1, PDON covers the knowledge domain
   of PD using 632 concept classes for which a high number of synonyms has
   been collected. The depth and width of the ontology reflect sufficient
   coverage of the PD knowledge domain with a reasonable distribution of
   concepts at various levels. The so-called Fanout-ness factor represents
   distribution of concepts over the entire ontology structure; its
   comparably high value is indicative of the balanced dispersion of
   concept classes throughout the ontology with consistent, broad
   representation of the knowledge domain across ontology branches.

Table 1.

   Summary of the structural parameters and their corresponding values
   measured for PDON
   Features No. of classes No. of synonyms Max. depth Depth variance Avg.
   width Fanout-ness
   PDON 631 505 8 1.74 78.8 0.81
   [73]Open in a new tab

Functional evaluation and gain of knowledge measurement:

   The model-based evaluation approach proposed in this work requires that
   a list of genes and proteins associated to all aspects of PD is being
   captured by PDON and assessed against the PD disease map as a widely
   accepted reference (see Evaluation section). Obviously, there is the
   need to expand the PDON to both coding and non-coding RNA, lipids and
   eventually to non-coding DNA and modification thereof as well. Since
   the knowledge space of PD is vast, PDON-driven faceted search enables
   us to distinguish between the core knowledge directly linked to PD
   pathophysiology and the emerging novel knowledge surrounding PD
   pathophysiology (e.g. observations through animal models or
   epidemiological data). For this purpose, separate queries were
   performed in the SCAIView environment (accessed on 28.04.2014):
    1. a query with all the PDON concepts, which resulted in a list of
       16333 human genes/proteins; and
    2. PDON branch queries were formulated as ([PDON Node: “<THE BRANCH
       CONCEPT>”]) AND [MeSH Disease:”Parkinson Disease”] AND [Free
       Text:” < THE BRANCH NAME>”]

   661 human-specific genes/proteins could be extracted from the PD map
   (i.e. gold standard), which were used for benchmarking the functional
   performance of the PDON. For this purpose, three branches of the PDON
   (Etiology, Clinical aspects, Neuropathology) were used to query PubMed
   abstracts in SCAIView as formulated above (see [74]Methods). Manual
   curation of the total number of retrieved documents per branch (N[TM])
   led to the identification of those genes/proteins that are relevant to
   the context of the searched branch (TM). Table [75]2 summarizes these
   results and shows calculation of the knowledge gain (i.e. the
   information gained with the support of ontology from text mining in
   addition to the information already existed in the gold standard, based
   on the formula described in the Methods section for the knowledge gain
   calculation) as well as enriched pathways for the gained knowledge to
   represent the content of this new knowledge. Accordingly, these results
   demonstrate that PDON-assisted search not only retrieved the majority
   of proteins already embedded in the PD map gold standard but also
   captured a large portion of the PD knowledge domain, which has not been
   represented in the PD disease map so far (gain of novel knowledge from
   the literature). The rest of genes/proteins that were retrieved by
   these PDON-driven queries but not found in the PD model represents
   additional potential knowledge gained from the literature relevant to
   PD. This new knowledge can be used after expert curation to extend or
   enrich the current PD disease map and thus, it is important to measure
   the added value of the potentially novel gained knowledge through the
   metric that was introduced in the Methods section.

Table 2.

   Parameter values and the final value of the knowledge gain calculated
   for three major branches of PDON. TM: number of relevant genes/proteins
   to the branch by PDON; GS: number of genes/proteins extracted from the
   PD map as gold standard; N: total number of genes/proteins for each
   branch retrieved by PDON. The queries were performed on Human
   Genes/Proteins and SCAIView returned lists of unique genes specific to
   each branch. Numbers represent counts of retrieved genes by SCAIView
   using PDON
   Knowledge domain branches TM TM ∩ GS N[TM] Gain of novel knowledge
   Enriched pathways in the content of new knowledge
   Etiology of PD 173 82 273 33 % MAPK, Chemokine, Adipocytokine,
   Neurotrophin, Insulin signaling
   Clinical aspects of PD 286 97 683 27 % GPCR signaling, Neuroactive
   ligand-receptor interactions, Rhodopsin-like receptors, Peptide
   ligand-binding receptors, Gastrin-CREB signaling
   Neuropathology of PD 252 91 471 34 % Immune system, Signaling by GPCR,
   Endocytosis, Toll-like receptor signaling, Hemostasis
   [76]Open in a new tab

   Table [77]2 lists parameters of the knowledge gain metric and
   corresponding novel pathways for selected views of the PDON. The
   highest percentage of new knowledge by PDON is gained in the branches
   representing neuropathology, etiology and clinical concepts,
   respectively. Functional analysis of these additionally identified
   genes/proteins shows that a couple of statistically significant
   pathways (in terms of both member proteins and p-values) could be added
   to the existing PD map.

Expert evaluation:

   PDON-driven information retrieval and extraction can guide analysis of
   literature in answering complex scientific questions. Experts in the
   knowledge domain of PD were asked to design complex questions highly
   relevant for their research work to be posed to the ontology. We
   selected 2 of these questions that contained concepts specific to the
   PDON and benchmarked the performance of PDON against PubMed by
   retrieving literature abstracts that contained hypotheses answering
   these competency questions. Table [78]3 provides an overview on the
   number of total hits as well as the number of relevant documents that
   were manually verified to contain a hypothetical answer to the
   corresponding question. Analyses were performed using both SCAIView and
   PubMed (see Table [79]3). The following queries were formulated based
   on the competency questions and were posed to SCAIView and PubMed
   retrieval systems:

Table 3.

   Results of PDON evaluation based on expert questions. For both
   competency questions, PDON-driven search in SCAIView retrieved less
   number of abstracts than simple queries in PubMed but more relevant to
   the questions (i.e. less noise). This performance efficiency for the
   PDON-driven search has been calculated in percent as shown in the last
   column
   Competency question number Total no. of abstracts retrieved by PDON in
   SCAIView No. of PDON-derived abstracts answering the questions Total
   no. of abstracts retrieved by PubMed No. of PubMed-derived abstracts
   answering the questions PDON-driven retrieval efficiency (% PDON
   retrieval - % PubMed retrieval)
   1 70 20 95 20 28.7 %-21 %: 7.7 %
   2 6 5 3 1 83.3 %-33.3 %: 50 %
   [80]Open in a new tab

Competency question 1. Return all literature references mentioning drugs used