Abstract Background Despite the unprecedented and increasing amount of data, relatively little progress has been made in molecular characterization of mechanisms underlying Parkinson’s disease. In the area of Parkinson’s research, there is a pressing need to integrate various pieces of information into a meaningful context of presumed disease mechanism(s). Disease ontologies provide a novel means for organizing, integrating, and standardizing the knowledge domains specific to disease in a compact, formalized and computer-readable form and serve as a reference for knowledge exchange or systems modeling of disease mechanism. Methods The Parkinson’s disease ontology was built according to the life cycle of ontology building. Structural, functional, and expert evaluation of the ontology was performed to ensure the quality and usability of the ontology. A novelty metric has been introduced to measure the gain of new knowledge using the ontology. Finally, a cause-and-effect model was built around PINK1 and two gene expression studies from the Gene Expression Omnibus database were re-annotated to demonstrate the usability of the ontology. Results The Parkinson’s disease ontology with a subclass-based taxonomic hierarchy covers the broad spectrum of major biomedical concepts from molecular to clinical features of the disease, and also reflects different views on disease features held by molecular biologists, clinicians and drug developers. The current version of the ontology contains 632 concepts, which are organized under nine views. The structural evaluation showed the balanced dispersion of concept classes throughout the ontology. The functional evaluation demonstrated that the ontology-driven literature search could gain novel knowledge not present in the reference Parkinson’s knowledge map. The ontology was able to answer specific questions related to Parkinson’s when evaluated by experts. Finally, the added value of the Parkinson’s disease ontology is demonstrated by ontology-driven modeling of PINK1 and re-annotation of gene expression datasets relevant to Parkinson’s disease. Conclusions Parkinson’s disease ontology delivers the knowledge domain of Parkinson’s disease in a compact, computer-readable form, which can be further edited and enriched by the scientific community and also to be used to construct, represent and automatically extend Parkinson’s-related computable models. A practical version of the Parkinson’s disease ontology for browsing and editing can be publicly accessed at [45]http://bioportal.bioontology.org/ontologies/PDON. Electronic supplementary material The online version of this article (doi:10.1186/s12976-015-0017-y) contains supplementary material, which is available to authorized users. Keywords: Parkinson’s disease, ontology, disease modeling, knowledge engineering Background Parkinson’s disease (PD), a progressive movement disorder, is the second most common neurodegenerative disease [[46]1]. The molecular etiology of sporadic PD has not been resolved yet and therefore PD is often called an “idiopathic” disease. In recent years, several attempts at elucidating the molecular etiology of PD have generated large omics data sets [[47]2]. The emerging systems view on the pathology of neurodegenerative diseases (NDDs) requires an efficient strategy to aggregate, standardize, represent, and communicate biomedical information through controlled vocabularies and ontologies [[48]3]. An ontology is defined as “an explicit specification of a conceptualization”, which aims to facilitate knowledge sharing [[49]4]. Ontologies are the basis for automated reasoning [[50]5], for large-scale annotation of entire genomes [[51]6, [52]7], for data mining in microarray data [[53]8], for prediction of biomolecular interactions [[54]9], and for semantic and ontological search in poorly structured information sources [[55]10, [56]11]. A large portfolio of widely accepted and widely used ontologies including Gene Ontology [[57]7], the Sequence Ontology [[58]12] and the Microarray Gene Expression Database Ontology [[59]8] has evolved in the life sciences. Gene ontology (GO) is the most frequently used ontology in biomedical sciences, which provides standard functional annotations for genes and gene products. Although GO has facilitated understanding of high-throughput results by means of enrichment analysis, one of its significant limitations is that it does not capture domain-specific biological complexity [[60]13]. For example, GO is devoid of any disease-specific context. It can not be used for answering questions like “which disease subtypes or syndromes are over-represented in my gene or protein set?” Hence, a more useful GO ideally should contain: i. disease-specific annotations, ii. disease-specific categories, and iii. semantics that cover disease knowledge domains. To include disease-specific biological processes, functionalities, and categories, disease-specific ontologies that cover a broad spectrum of relevant knowledge are required. Disease ontologies may reference source terminologies and vocabularies with a hierarchical concept classification such as the SNOMED CT nomenclature [[61]14], the ICD ontology [[62]15] and the human disease ontology [[63]16]. These ontologies contain human disease concepts but their high-level, broad coverage and the lack of depth in these ontologies restrict their usage for specific disease domains. Malhotra and colleagues addressed this issue in the area of NDDs by construction of Alzheimer’s disease ontology (ADO) to cover clinical, etiological, molecular and cellular mechanism aspects of AD [[64]17]. The aim of developing ADO was to enable semantic mining of patient records and literature for effective retrieval and extraction of accurate AD-related information, which could be used for modeling disease processes. Similarly, there is a need for organizing the knowledge domain of PD. In response to this unmet need, we aimed at creating a disease ontology for PD (PDON) that spans from the molecular biology of the disease to clinical readouts. PDON has been created with a subclass-based hierarchy that – for the majority of concepts - uses subsumption relation (i.e. is_a). However, based on demand from biomedical experts for richer relations, partonomy relation (i.e. part_of) was also introduced, as a concession to better usability of the ontology by interdisciplinary experts. We demonstrate the ontology's usability through our use cases, keeping in mind that, re-usability by other teams is an important aspect in ontology construction and its adoption by the community. As a proof of concept, we performed evaluation of PDON performance by measuring its ability to recover pre-existing, expert-curated information from the knowledge space of PD in the literature with the aim of generating novel insights and hypotheses. The power of the ontology can be applied to several scenarios, e.g. building recommendation systems by mapping drug failure events to mechanisms and stages of disease/stages of drug discovery, or distinguishing proven facts from hypotheses and speculations [[65]18]. Furthermore, in the emerging era of systems analysis of NDDs, such a knowledge-driven approach is expected to support the integration of multiscale and multilevel information across different biological scales, from molecular networks to clinical readouts. To this end, we have proposed a model-driven approach to integrating biomedical knowledge and data into mechanistic models that represent cause-and-effect relationships among molecular entities, biological processes and their corresponding clinical outcomes. Using this strategy in the current study, we demonstrate how PDON can be utilized not only to causally link molecular etiology of PD to impaired biological processes and their corresponding disease outcome (Application scenario 1) but also to annotate experimental datasets with their corresponding knowledge description for further integration into disease models (Application scenario 2). Results The purpose of the PDON is to communicate and share the PD knowledge in a standard form and support text-mining and knowledge discovery. Furthermore, for the construction of a large, integrative knowledge base on neurodegenerative diseases, PDON can be used for metadata annotation of various omics data sets available in the public domain. The PDON encompasses clinical and non-clinical aspects of PD and is expected to support retrieval of information on syndromes, etiology, treatment, experimental models, diagnosis and symptoms of PD (Fig. [66]1). Fig. 1. Fig. 1 [67]Open in a new tab Upper-level classes of PDON as represented in the Protégé ontology editor software. Super-classes represent different biological views (perspectives) suggested by experts under which PD-specific knowledge is modeled PDON represents a range of key concepts specific to the knowledge domain of Parkinson’s disease through different views, which have been integrated in the ontology. Views are root super-classes that organize concepts within a certain knowledge domain, as they are realized and seen by experts in reality. PDON is represented by nine views: The view ‘Clinical aspects’ describes a broad range of motor and non-motor features displayed by PD patients. These features have been classified into three upper-level categories that capture clinical concepts related to “diagnostics”, “symptomatology” and “treatment” of PD. The view ‘Etiology’ captures both genetic and environmental factors that are known to cause familiar PD and Parkinsonism due to toxic dopaminergic cell death, respectively. Toxic and infectious agents as well as genetic mutations are classified within this view; in addition, epidemiologically confirmed factors such as smoking and pesticides have been also included. The view ‘Model of Parkinson’s disease’ contains various in-vivo and in-vitro disease models that are in routine use in PD research. The view ‘Neuropathology’ was included to highlight two prototypic hypotheses of PD-related mechanisms, namely synucleopathy and the emerging tauopathy. It is expected that this view is populated further and enriched with more neuropathological concepts by the PD research community. The view ‘Familial neurodegenerative disease’ includes those hereditary disorders that are clinically associated with PD, such as Huntington’s and Wilson’s diseases. ‘Idiopathic Parkinson’s disease’, ‘Primary parkinsonism’, ‘Secondary parkinsonism’, and ‘Parkinson-plus Syndrome’ represent four separate views as per recommendation of the clinical expert panel. These views provide a categorized overview on distinct syndromes associated with PD based on their origin of cause. For example, the primary parkinsonism class represents parkinsonian syndromes for which a definite cause has been identified (e.g. mutations in PARK genes), whereas secondary parkinsonism syndromes are induced by a hypothetical cause that is potentially identifiable. Those syndromes with unknown causative factor have been clinically assigned to the Parkinson-plus view. In PDON, each concept class is supported by a scientific definition, a valid scientific reference (if available) and existing synonyms (Fig. [68]2). Definitions have been selected from review papers, journal articles and handbooks with consideration of the consensus definitions accepted in the PD research community. It is noteworthy that the PDON is expected to grow over time by inclusion of missing or emerging concepts. Due to dynamic research in the PD field, the structure of ontology is subject to change. We do explicitly invite experts in the field to critically review, revise and optimize the draft ontology presented in this manuscript. The ontology will be updated based on the feedback collected from experts, which includes concept edition, re-defining concepts with missing or insufficient explanation, or new relationship proposal. This is accomplished through the possibility of adding comments or proposals to the ontology’s webpage on the BioPortal repository. The Bioinformatics group at Fraunhofer Institute SCAI that owns the ontology collects these feedbacks and manages the updated releases. The ontology can be freely accessed and downloaded at [69]http://www.scai.fraunhofer.de/en/business-research-areas/bioinforma tics/downloads.html. Fig. 2. Fig. 2 [70]Open in a new tab A snapshot of the annotation field for PDON concepts as presented in the Protégé ontology editor. Each PDON concept has been annotated with definition, reference, and synonyms Structural evaluation: PDON was evaluated for structural features reflecting its topology and logical properties. The high-level semantic framework of PDON contains nine super-classes, followed by sub-classes that specifically capture the sub-domain knowledge of PD. PDON was characterized for its structural features using parameters that have been summarized in Table [71]1. As shown in Table [72]1, PDON covers the knowledge domain of PD using 632 concept classes for which a high number of synonyms has been collected. The depth and width of the ontology reflect sufficient coverage of the PD knowledge domain with a reasonable distribution of concepts at various levels. The so-called Fanout-ness factor represents distribution of concepts over the entire ontology structure; its comparably high value is indicative of the balanced dispersion of concept classes throughout the ontology with consistent, broad representation of the knowledge domain across ontology branches. Table 1. Summary of the structural parameters and their corresponding values measured for PDON Features No. of classes No. of synonyms Max. depth Depth variance Avg. width Fanout-ness PDON 631 505 8 1.74 78.8 0.81 [73]Open in a new tab Functional evaluation and gain of knowledge measurement: The model-based evaluation approach proposed in this work requires that a list of genes and proteins associated to all aspects of PD is being captured by PDON and assessed against the PD disease map as a widely accepted reference (see Evaluation section). Obviously, there is the need to expand the PDON to both coding and non-coding RNA, lipids and eventually to non-coding DNA and modification thereof as well. Since the knowledge space of PD is vast, PDON-driven faceted search enables us to distinguish between the core knowledge directly linked to PD pathophysiology and the emerging novel knowledge surrounding PD pathophysiology (e.g. observations through animal models or epidemiological data). For this purpose, separate queries were performed in the SCAIView environment (accessed on 28.04.2014): 1. a query with all the PDON concepts, which resulted in a list of 16333 human genes/proteins; and 2. PDON branch queries were formulated as ([PDON Node: “”]) AND [MeSH Disease:”Parkinson Disease”] AND [Free Text:” < THE BRANCH NAME>”] 661 human-specific genes/proteins could be extracted from the PD map (i.e. gold standard), which were used for benchmarking the functional performance of the PDON. For this purpose, three branches of the PDON (Etiology, Clinical aspects, Neuropathology) were used to query PubMed abstracts in SCAIView as formulated above (see [74]Methods). Manual curation of the total number of retrieved documents per branch (N[TM]) led to the identification of those genes/proteins that are relevant to the context of the searched branch (TM). Table [75]2 summarizes these results and shows calculation of the knowledge gain (i.e. the information gained with the support of ontology from text mining in addition to the information already existed in the gold standard, based on the formula described in the Methods section for the knowledge gain calculation) as well as enriched pathways for the gained knowledge to represent the content of this new knowledge. Accordingly, these results demonstrate that PDON-assisted search not only retrieved the majority of proteins already embedded in the PD map gold standard but also captured a large portion of the PD knowledge domain, which has not been represented in the PD disease map so far (gain of novel knowledge from the literature). The rest of genes/proteins that were retrieved by these PDON-driven queries but not found in the PD model represents additional potential knowledge gained from the literature relevant to PD. This new knowledge can be used after expert curation to extend or enrich the current PD disease map and thus, it is important to measure the added value of the potentially novel gained knowledge through the metric that was introduced in the Methods section. Table 2. Parameter values and the final value of the knowledge gain calculated for three major branches of PDON. TM: number of relevant genes/proteins to the branch by PDON; GS: number of genes/proteins extracted from the PD map as gold standard; N: total number of genes/proteins for each branch retrieved by PDON. The queries were performed on Human Genes/Proteins and SCAIView returned lists of unique genes specific to each branch. Numbers represent counts of retrieved genes by SCAIView using PDON Knowledge domain branches TM TM ∩ GS N[TM] Gain of novel knowledge Enriched pathways in the content of new knowledge Etiology of PD 173 82 273 33 % MAPK, Chemokine, Adipocytokine, Neurotrophin, Insulin signaling Clinical aspects of PD 286 97 683 27 % GPCR signaling, Neuroactive ligand-receptor interactions, Rhodopsin-like receptors, Peptide ligand-binding receptors, Gastrin-CREB signaling Neuropathology of PD 252 91 471 34 % Immune system, Signaling by GPCR, Endocytosis, Toll-like receptor signaling, Hemostasis [76]Open in a new tab Table [77]2 lists parameters of the knowledge gain metric and corresponding novel pathways for selected views of the PDON. The highest percentage of new knowledge by PDON is gained in the branches representing neuropathology, etiology and clinical concepts, respectively. Functional analysis of these additionally identified genes/proteins shows that a couple of statistically significant pathways (in terms of both member proteins and p-values) could be added to the existing PD map. Expert evaluation: PDON-driven information retrieval and extraction can guide analysis of literature in answering complex scientific questions. Experts in the knowledge domain of PD were asked to design complex questions highly relevant for their research work to be posed to the ontology. We selected 2 of these questions that contained concepts specific to the PDON and benchmarked the performance of PDON against PubMed by retrieving literature abstracts that contained hypotheses answering these competency questions. Table [78]3 provides an overview on the number of total hits as well as the number of relevant documents that were manually verified to contain a hypothetical answer to the corresponding question. Analyses were performed using both SCAIView and PubMed (see Table [79]3). The following queries were formulated based on the competency questions and were posed to SCAIView and PubMed retrieval systems: Table 3. Results of PDON evaluation based on expert questions. For both competency questions, PDON-driven search in SCAIView retrieved less number of abstracts than simple queries in PubMed but more relevant to the questions (i.e. less noise). This performance efficiency for the PDON-driven search has been calculated in percent as shown in the last column Competency question number Total no. of abstracts retrieved by PDON in SCAIView No. of PDON-derived abstracts answering the questions Total no. of abstracts retrieved by PubMed No. of PubMed-derived abstracts answering the questions PDON-driven retrieval efficiency (% PDON retrieval - % PubMed retrieval) 1 70 20 95 20 28.7 %-21 %: 7.7 % 2 6 5 3 1 83.3 %-33.3 %: 50 % [80]Open in a new tab Competency question 1. Return all literature references mentioning drugs used