Abstract

   PolyCystic Ovary Syndrome KnowledgeBase (PCOSKB[R2]) is a manually
   curated database with information on 533 genes, 145 SNPs, 29 miRNAs,
   1,150 pathways, and 1,237 diseases associated with PCOS. This data has
   been retrieved based on evidence gleaned by critically reviewing
   literature and related records available for PCOS in databases such as
   KEGG, DisGeNET, OMIM, GO, Reactome, STRING, and dbSNP. Since PCOS is
   associated with multiple genes and comorbidities, data mining
   algorithms for comorbidity prediction and identification of enriched
   pathways and hub genes are integrated in PCOSKB[R2], making it an ideal
   research platform for PCOS. PCOSKB[R2] is freely accessible at
   [34]http://www.pcoskb.bicnirrh.res.in/.

   Subject terms: Databases, Reproductive disorders, Gene regulatory
   networks

Introduction

   Polycystic ovary syndrome (PCOS) is the most common endocrine disorder
   in women of reproductive age^[35]1. The syndrome encompasses a broad
   spectrum of signs and symptoms, making the diagnosis of PCOS
   challenging. There exist many society-based guidelines for PCOS
   diagnosis such as the (i) Rotterdam criteria accepted by European
   Society for Human Reproduction and Embryology(ESHRE) and American
   Society for Reproductive Medicine (ASRM)^[36]2; (ii) National
   Institutes of Health or National Institute of Child Health and Human
   Disease (NIH/NICHD) criteria^[37]3 and (iii) Androgen Excess and PCOS
   Society (AE-PCOS/AES) criteria^[38]4. These guidelines rely on the
   presence of oligo-anovulation and hyperandrogenism, after excluding
   other androgen excess or related disorders, for diagnosis of PCOS. The
   prevalence of PCOS globally ranges from 2.2 to 26% contingent upon the
   population assessed and the criteria used for evaluation^[39]5. Many of
   the women with PCOS suffer from various comorbid conditions such as
   glucose intolerance^[40]6, type-II diabetes^[41]7, cardiovascular
   ailments^[42]8, anxiety disorders^[43]9, bipolar disorders^[44]10 and
   sleep-related disorders^[45]11.

   The increasing prevalence of PCOS and its profound impact on the
   physical and mental health of women has catapulted research efforts to
   elucidate the genetic etiology and pathophysiology of PCOS^[46]12.
   This, in turn, has led to a surge in PCOS-related data available in the
   public domain; calling for an urgent need to manually curate and
   collate this information as online databases for researchers and
   clinicians.

   The databases dedicated to PCOS, currently available online are
   PCOSKB^[47]13 and PCOSBase^[48]14. As on date, PCOSDB^[49]15 is not
   accessible. PCOSBase, categorized as a manually curated database, lists
   8,185 proteins as associated with PCOS. This data is a compilation from
   9 databases and 30 published expression studies, without having
   stringent criteria for cataloguing a protein as “PCOS-related”. PCOSKB,
   developed by our group in 2015, was created by critically reviewing the
   scientific literature available for PCOS. The manual curation exercise
   resulted in a list of 241 genes, which was further linked with relevant
   molecular, biochemical, and clinical data along with supporting
   reference literature.

   Over the past 5 years, there has been a significant increase in the
   data available on PCOS. Here, we present an update to the content and
   functionality of the PCOSKB database. PCOSKB[R2] holds information of
   533 genes and 29 miRNAs (manually curated) identified from
   peer-reviewed literature, based on experiments such as RT-PCR, western
   blotting, immunochemistry, and cell-based assays. Additionally,
   information on 4,023 genes identified from microarray expression
   studies on PCOS is also included in PCOSKB[R2]. The PCOS genes are
   further linked with gene ontology terms, pathways, diseases, and SNPs.

   Besides retrieving data, researchers can analyse the data in
   PCOSKB[R2,]using various tools embedded in the database such as
   Comorbidity analysis for estimating the risk of diseases to co-occur
   with PCOS; Network analysis for identifying enriched pathways and hub
   genes and Venn analysis^[50]16 for finding common and unique genes,
   pathways and ontologies. PCOSKB[R2] will enable researchers and
   clinicians to efficiently interrogate the published data on PCOS and
   identify gaps in our current understanding of PCOS and its
   comorbidities.

Results and discussion

   PCOSKB[R2] was developed using PHP 7.2.24, MariaDB Server 10.1.44,
   JavaScript, AnyChart 8.7.1, vis.js 4.21, R version 3.6.3 and XHTML 1.0.
   It has client server-based architecture and is hosted on Apache
   webserver 2.4.29 with a Linux environment.

   PCOSKB[R2] has an interactive and user-friendly interface. The homepage
   provides a short description of the database and its functionalities.
   The data is organized into datasets dedicated to (a) genes, (b) miRNAs,
   (c) SNPs, (d) diseases, (e) pathways, and (f) gene ontology terms
   associated with PCOS (Fig. [51]1a,b). These datasets can be easily
   accessed using the navigation tabs located on the top panel of the
   webpage. A brief description of these tabs is given below:
     * Search
         1. Quick search enables users to retrieve information based on
            keywords; all the information available in PCOSKB related to
            the keyword is displayed.
         2. Advanced search enables users to build specific queries for a
            gene, protein, SNP, miRNA, diseases, or pathways associated
            with PCOS.
     * Browse This tab enables users to surf the datasets for genes,
       miRNAs, SNPs, diseases, pathways, and gene ontology terms
       associated with PCOS.
     * Tools Algorithms for comorbidity, network, and Venn analysis can be
       accessed here.
         1. Comorbidity Analysis This tool can be used to predict
            comorbidity for selected diseases based on (i) shared genes,
            (ii) uniqueness of shared genes, (iii) shared ontologies, and
            (iv) network-based separation of shared genes (Fig. [52]1c1).
            The results for each of these modules can be downloaded as
            heatmap images (colored based on comorbidity scores) and
            spreadsheets with detailed information on shared genes and
            pairwise comorbidity scores for the selected diseases.
         2. Network analysis The tool provides a disease-disease network
            for selected diseases, the enriched pathways in these
            diseases, and the hub and bottleneck genes that are critical
            for these diseases (Fig. [53]1c3). The results can be
            downloaded as spreadsheets or images.
         3. Venn analysis This tool can be used to illustrate the unique
            and/or common genes, pathways, and ontologies for 2 or more
            (up to 6) diseases (Fig. [54]1c2). The analysis can be
            downloaded as Venn images or spreadsheets.
     * Help: This page provides detailed information, with examples, for
       efficiently navigating the PCOSKB interface and using the
       data-mining algorithms.

Figure 1.

   [55]Figure 1
   [56]Open in a new tab

   Conceptual and relational view of data and tools in PCOSKB[R2].

   The applications of these datasets and algorithms for estimating the
   comorbidity risk and understanding the genetic and functional overlap
   in comorbid conditions of PCOS are demonstrated by case studies.
     * A.
       Estimation of comorbidity risk:

   Case 1: PCOS, Diabetes, and Hypertension.

   There is ample clinical evidence that women with PCOS are more likely
   to suffer from diabetes and hypertension as compared to other cardiac
   ailments^[57]17–[58]20.

   The comorbidity risk can be estimated using the ‘Comorbidity analysis’
   algorithm in PCOSKB[R2]. In accordance with the clinical reports, when
   disease terms such as diabetes mellitus, hypertensive diseases along
   with a less frequently observed comorbidity such as aortic diseases
   were analyzed for comorbidity scores; it was found that the risk for
   diabetes and hypertensive diseases to co-occur with PCOS was much
   higher as compared to aortic diseases. Expectedly, the maximum
   comorbidity score amongst the selected diseases was found to be between
   aortic diseases and hypertension (Fig. [59]2A). The above example
   illustrates the utility of the comorbidity analysis algorithm for
   estimating the risk of diseases to co-occur in PCOS.

Figure 2.

   [60]Figure 2
   [61]Open in a new tab

   Network-based comorbidity analysis for PCOS and (A) diabetes and
   hypertension; (B) psychological disorders.

   Case 2: PCOS and Psychological disorders.

   Women with PCOS are known to have an increased risk (albeit at varying
   levels) of suffering from mental health conditions such as anxiety,
   depression, and schizophrenia^[62]21,[63]22. A study by Rassi et al.,
   concluded that 57% of women with PCOS are diagnosed with at least one
   of the psychiatric disorders^[64]23. In an ambulatory population of 72
   women with PCOS, it was observed that mental depression and
   schizophrenia were the most and least prevalent respectively among the
   psychiatric disorders^[65]23. Through a population-based retrospective
   study in a cohort of 5,431 women with PCOS and 21,724 controls, a
   significantly higher incidence of depressive and anxiety disorders were
   reported in women with PCOS^[66]24. In another study, the prevalence of
   psychiatric comorbidity and depression was reported as the most common
   disorder in women with PCOS followed by anxiety^[67]25. Meta-analysis
   of 57 studies (172,040 patients) summarised that women with PCOS were
   most likely to get diagnosed with depression followed by
   anxiety^[68]26.

   These clinical observations were accurately captured through the
   comorbidity scores generated using the network-based separation method.
   Mental depression had the highest comorbidity risk followed by anxiety
   disorders and schizophrenia (Fig. [69]2B). It is noteworthy that
   although maximum number of genes (124) overlapped between PCOS and
   schizophrenia, as reflected in the edge thickness between these 2
   disease nodes; comorbidity analysis correctly estimated the least risk
   for comorbidity with schizophrenia amongst the three mental diseases,
   in accordance with literature reports; highlighting the predictive
   power of network-based separation method for comorbidity analysis.
     * B.
       Identification of the genetic and functional overlap in comorbid
       conditions.

   Case 1: PCOS, Diabetes, and Hypertension.

   Although, diabetes and hypertension are commonly observed comorbid
   conditions in women with PCOS; not much is known about the genetic
   overlap of these disorders^[70]27.

   Venn analysis revealed that 32 genes and 364 pathways are commonly
   associated with PCOS, diabetes, and hypertension (Supplementary Table
   [71]S1). Network analysis identified 104 enriched pathways, 21 hub
   genes, and 10 bottleneck genes for these diseases (Supplementary Figs.
   [72]S1a1 and [73]S1a2, Supplementary Table [74]S1). Hub genes, due to
   their high degree of inter-cluster connectivity, play an important role
   in the crosstalk of enriched pathways. We mined literature for
   ascertaining the association of these 21 genes with the comorbid
   conditions of diabetes, hypertension, and PCOS. Of the 21 genes, we
   found literature evidence for association of four genes (ESR1, PTGS2,
   LEP, PPARG) with these comorbidities, as detailed below.
     * (i)
       ESR1 codes for estrogen receptor alpha and hence ESR1 mutations can
       increase the risk of estrogen-dependent pathophysiologies. In a
       study by Zhao L et al., ESR1 polymorphisms were reported to be
       associated with hypertension and diabetes^[75]28. A case–control
       study by Jiao X et al., documented that altered expression of ESR1
       can influence the risk of PCOS and its upregulation may contribute
       to abnormal follicular development^[76]29,[77]30.
     * (ii)
       Prostaglandin-endoperoxide synthase (PTGS2) is a key enzyme for
       biosynthesis of the inflammatory hormone prostaglandin. It is known
       to be upregulated in granulosa cells of women with PCOS and
       arteries of patients with hypertension and diabetes^[78]31,[79]32.
     * (iii)
       Leptin hormone encoded by the leptin gene (LEP) plays an important
       role in the regulation of energy homeostasis and body weight
       management. Several independent studies have reported the
       association of leptin receptor deficiency in diabetes,
       hypertension, and PCOS. High circulatory leptin has been observed
       in patients with a cluster of metabolic syndrome including
       hypertension, diabetes^[80]33, and PCOS^[81]69.
     * (iv)
       Peroxisome proliferator-activated receptor gamma (PPARG) regulates
       adipocyte differentiation and thereby controls beta-oxidation of
       fatty acids. Mutations in PPARG are known to increase the risk for
       development of hypertension and diabetes^[82]34.

   In addition to the identification of hub and bottleneck genes, the View
   interaction option in the Gene network analysis tool can be used to
   display the tissue-specific interacting partners of each gene in the
   network (Supplementary Fig. [83]S1). Using this feature, we identified
   two genes (PON1, ADIPOQ) that interact with multiple hub genes
   (Supplementary Figs. [84]S1a3 and [85]S1a4). PON1 interacts with six
   hub genes (TNF, IL6, INS, CCL2, LEP, PPARG) and one bottleneck gene
   (LIPC) (Supplementary Fig. [86]S1a4). Adiponectin (ADIPOQ) interacts
   with 19 hub genes that are expressed in adipose tissue (Supplementary
   Fig. [87]S1a3). The association of both these genes in the comorbid
   conditions of type 2 diabetes, hypertension, and PCOS is documented in
   the literature. Paraoxonase-1 (PON1) mediates enzymatic protection of
   low-density lipoprotein (LDL) against oxidative modifications and is
   known to be associated with diabetes, hypertension, and
   PCOS^[88]35,[89]36. Low levels of adiponectin are associated with
   several obesity-related disorders^[90]37 and ADIPOQ is a biomarker for
   type-2 diabetes, hypertension^[91]38, and PCOS^[92]39.

   This case study illustrates the utility of the Gene network analysis
   tool in deciphering the genetic and functional overlap of comorbid
   conditions. While the role of all the identified hub genes in PCOS,
   diabetes, and hypertension individually has been well established, it
   would be worthwhile to establish the role of these hub genes in the
   pathophysiology of PCOS, diabetes, and hypertension, as a combined
   disease state, and explore them as polypharmacological drug targets.

   Case 2: PCOS and Psychological disorders—anxiety and mental depression.

   Insulin resistance, obesity, and altered levels of androgens
   (Supplementary Table [93]S2) have been reported as the common
   pathophysiological link between PCOS and psychiatric
   disorders^[94]24,[95]40. Interestingly, evaluation of enriched pathways
   for the top two psychological disorders (mental depression and anxiety)
   that are comorbid with PCOS revealed pathways that represent these
   cellular mechanisms (Supplementary Table [96]S2, Supplementary Figs.
   [97]S1b1 and [98]S1b2, Supplementary Table [99]S1).

   Network analysis of the enriched pathways revealed 21 hub genes and 10
   bottleneck genes. Of these, the role of two hub genes (IL6, STAT3) in
   the comorbidity of PCOS and selected psychiatric disorders has been
   reported in literature. Kawamura S et al., reported elevated levels of
   inflammatory cytokine IL6 in women suffering from PCOS and
   depression^[100]41. The negative association of STAT3 with anxiety and
   depression have been reported by Feng and Shao in PCOS induced rat
   models^[101]42. Anxiety and depression in rats were analysed based on
   their decreased locomotor activity in behavioural tests such as
   open-field tests, object recognition tests, and elevated plus maze
   tests.

   Case 3: PCOS and Pregnancy-related disorders—preeclampsia.

   Women with PCOS are known to be at higher risk of pregnancy-related
   disorders as compared to women without PCOS^[102]43,[103]44. In PCOSKB,
   genes, and miRNAs associated with pregnancy-related disease terms like
   “Pregnancy complications, Cardiovascular”, “Pregnancy associated
   hypertension”, “Ectopic pregnancy”, “Gestational diabetes”, and
   “Preeclampsia” can be accessed under the disease category of
   reproductive disorders.

   miRNAs are known to play a critical role in the pathogenesis of PCOS
   and pregnancy-related disorders^[104]45–[105]47. Pathways such as
   adipocytokine signaling, oxytocin signaling, TNF signaling,
   progesterone-mediated oocyte maturation, estrogen signaling, MAPK, and
   FoxO signaling are known to be regulated by miRNAs and associated with
   pregnancy outcome^[106]48,[107]49.

   miRNA-based pathway enrichment analysis of preeclampsia revealed 88
   enriched pathways that included progesterone-mediated oocyte
   maturation, estrogen signaling, MAPK signaling, and FoxO signaling
   pathways (Supplementary Table [108]S1); these pathways are known to be
   associated with PCOS and preeclampsia in literature^[109]49–[110]51.

Conclusion and future directions

   The aim of developing PCOSKB[R2] was to provide a one-stop online
   portal for accessing manually curated information on PCOS to the
   community of clinicians and researchers. The genes, listed in the
   manually curated dataset of PCOSKB[R2] were identified based on the
   inference and data mined from publications. Relevant annotations of
   these genes such as gene interactions, pathway associations, and SNPs
   have been provided along with links to the reference literature.

   This second release of PCOSKB has substantial advancement both in terms
   of data and analysis tools^[111]13. In addition to the advanced search
   and browser features for efficiently interrogating the database, users
   can avail of the tools to predict comorbidity risks, enriched pathways,
   and hub genes for selected diseases. These tools are powerful for
   gaining insights on the comorbidities of PCOS and the underlying
   gene-pathway associations, as can be seen by the aforementioned case
   studies. However, users need to be aware and cautious of the publishing
   or literature bias that can lead to erroneous inferences.

   The impact of publication bias on the results of the comorbidity
   analysis tool can be assessed by the following example. Women with PCOS
   are known to suffer from an increased risk of endometrial cancer
   followed by ovarian cancer as compared to women without PCOS^[112]50.
   The incidence of breast cancer is similar in women with and without
   PCOS^[113]41,[114]50,[115]51. The comorbidity analysis tool, using the
   method of shared genes, incorrectly predicted the highest risk of
   comorbidity for breast, followed by ovarian and least for endometrial
   cancer (Fig. [116]3). This error is inadvertently caused due to the
   positive publication bias for breast cancer (407,285 PubMed records) as
   compared to ovarian (116,514 PubMed records) and endometrial cancers
   (37,950 PubMed records). Hence, the genes that are known to be
   associated with endometrial cancer are far lesser (38 genes) than
   ovarian (57 genes) and breast cancers (129 genes).

Figure 3.

   [117]Figure 3
   [118]Open in a new tab

   Comorbidity analysis for PCOS and cancers using (a) shared genes and
   (b) network-separation methods.

   The network separation based algorithm identified the highest
   comorbidity risk for ovarian, followed by breast and endometrial
   cancers (Fig. [119]3). The network separation method is based on the
   distance/separation of the disease-causing genes in pathway networks
   and therefore is more robust and less dependent (not independent) on
   the number of disease-causing genes as compared to the algorithm of
   shared genes. This algorithm should, therefore, be the choice for
   comorbidity prediction when a fewer number of diseases; with
   possibility for publication bias is analysed.

   The incidence of PCOS is rising globally^[120]52–[121]56 and we expect
   the data, generated on PCOS, to increase exponentially in the years to
   come. Depending on the availability and nature of data generated from
   these research efforts, PCOSKB[R2] will be updated with new information
   and analysis tools. Hopefully, with more data, the negative impact of
   publication bias will be reduced. PCOSKB[R2] will be a comprehensive
   source of updated and curated information on gene-disease-pathway
   associations in PCOS and its comorbidities.

Methods

Dataset curation

Curation of the gene dataset

   The genes associated with PCOS were identified by querying
   PubMed^[122]57 with MeSH(Medical Subject Headings)^[123]58 terms such
   as, “Ovary Syndrome, Polycystic”, “Syndrome, Polycystic Ovary”,
   “Stein-Leventhal Syndrome”, “Stein Leventhal Syndrome”, “Syndrome,
   Stein-Leventhal”, “Sclerocystic Ovarian Degeneration”, “Ovarian
   Degeneration, Sclerocystic”, “Sclerocystic Ovary Syndrome”, “Polycystic
   Ovarian Syndrome”, “Ovarian Syndrome, Polycystic”, “Polycystic Ovary
   Syndrome 1”, “Sclerocystic Ovaries”, “Ovary, Sclerocystic”,
   “Sclerocystic Ovary”, “PCOS” and “Gene”. Using this query, 1561
   literature records were retrieved from PubMed.

   The association of 533 genes with PCOS was manually confirmed by
   critically reviewing the 1561 publications. A gene was verified to be
   PCOS-associated if the literature mentions experimental evidence based
   on RT-PCR, western blotting, immunochemistry, and cell-based assays.
   Additional annotations such as nature of the study population,
   ethnicity, mutations/SNPs, unique identifiers for gene and protein
   records, protein structures, family and ontology details, metabolic
   pathway information were obtained from literature and mapping the gene
   records to databases such as Gene^[124]59, dbSNP^[125]60,
   Ensembl^[126]61, UniProt^[127]62, PDB^[128]63, GO^[129]64,
   KEGG^[130]65, OMIM^[131]66, Reactome^[132]67 and STRING^[133]68
   (Supplementary Table [134]S3).

Curation of the gene-disease association dataset

   Disease associations of the PCOS genes were retrieved from
   DisGeNET^[135]69 and PubMed^[136]57 databases. The disease terms in
   DisGeNET that are linked to PubMed literature and have an active
   MedGen^[137]70 ConceptID (CUI) were retained for further curation. The
   terms with disease type as “phenotype” and disease semantic type as
   “finding”, “pathologic function”, “sign or symptom”, “injury or
   poisoning”, “experimental model of disease”, “experimental model of
   disease; Neoplastic process”, “anatomical abnormality”, “organism
   attribute” were discarded from the list as the terms under these
   headers did not refer to diseases.

   This list was further subdivided into two sets based on the source of
   information in DisGeNET^[138]69. Dataset ‘A’ comprised of gene-disease
   associations collated in DisGeNET from manually curated databases such
   as ClinVar^[139]71, CTD^[140]72, Genomics England^[141]73, GWAS
   Catalog^[142]74 and GWAS^[143]75 and Dataset ‘B’ had information
   collated from text mining datasets such as BEFREE^[144]76 and
   LHGDN^[145]77. Since dataset ‘A’ records were from curated sources,
   these were included in PCOSKB[R2] without further verification. For
   dataset ‘B’, gene-disease associations were validated based on rigorous
   manual curation. The associated literature was reviewed carefully and
   evidence for gene-disease association was sourced from experimental
   techniques involving human samples, such as RT-PCR, western blotting,
   immunochemistry, and cell-based assays. Genes that did not have any
   disease information in DisGeNET were queried in PubMed and publication
   records were mined using pubmed.mineR package^[146]78.

   In cases, wherein multiple disease terms referred to the same disease,
   the terms were retitled as explained in Table [147]1.

Table 1.

   Rules for redundancy elimination in gene-disease association dataset.
   S. No Types of redundancy Examples
   Disease terms Modified term
   1 Target organ of disease ‘Malignant neoplasm of ovary’, ‘ovarian
   neoplasm’, ‘Epithelial ovarian cancer’ Ovarian cancer
   2 Age of onset of disease ‘Adult type dermatomyositis’,
   ‘Dermatomyositis, Childhood Type’,‘Dermatomyositis’ Dermatomyositis
   3 Synonyms of disease ‘Mental Depression’, ‘Major Depressive Disorder’,
   ‘Depressive disorder’ Mental Depression
   4 Severity of disease ‘Mental disorder’, ‘Mental disorder, severe’,
   ‘Mental disorder, acute’, ‘mental disorder, chronic’ Mental disorder
   [148]Open in a new tab

Unique categorization of disease groups

   Many of the disease terms in DisGeNET^[149]69 are mapped to multiple
   MeSH^[150]58 headings. E.g. ovarian neoplasm is linked to neoplasms and
   reproductive disorders. An empirical rule-based method based on
   ICD-11^[151]79 classification (Fig. [152]4) was adopted to uniquely
   categorize the disease terms at the parent level.

Figure 4.

   [153]Figure 4
   [154]Open in a new tab

   ICD-11 based rules for non-redundant categorization of disease terms.
   Ovals represent retitled parent disease terms.

   For complete documentation of merged terms refer to Supplementary Table
   [155]S1.

Tools

Comorbidity analysis

   For a pair of diseases (
   [MATH: <msub><mi>D</mi><mi>i</mi></msub> :MATH]
   [,]
   [MATH: <msub><mi>D</mi><mi>j</mi></msub> :MATH]
   ), the list of PCOS-associated genes was retrieved from the
   gene-disease dataset of PCOSKB[R2] (see “[156]Curation of the
   gene-disease association dataset” section). Four different algorithms
   have been used to predict the risk of comorbidity in women with PCOS.
   The comorbidity scores are illustrated as dynamic heat maps created
   using AnyChart JS^[157]80 package.

Based on shared genes

   This method is based on the principle that disease relationships are
   dependent on their shared genes^[158]81. A score to predict the risk of
   diseases
   [MATH: <msub><mi>D</mi><mi>i</mi></msub> :MATH]
   and
   [MATH: <msub><mi>D</mi><mi>j</mi></msub> :MATH]
   to co-occur is calculated using the below equation
   [MATH:
   <mrow><mi>C</mi><mi>o</mi><mi>m</mi><mi>o</mi><mi>r</mi><mi>b</mi><mi>i
   </mi><mi>d</mi><mi>i</mi><mi>t</mi><msub><mi>y</mi><mrow><mi
   mathvariant="italic">sharedgenes</mi></mrow></msub><mfenced close=")"
   open="("><mrow><msub><mi>D</mi><mi>i</mi></msub><mo>,</mo><msub><mi>D</
   mi><mi>j</mi></msub></mrow></mfenced><mo>=</mo><mfenced close="]"
   open="["><mfrac><mfenced close=")"
   open="("><mrow><msub><mi>G</mi><msub><mi>D</mi><mi>i</mi></msub></msub>
   <mo>∩</mo><msub><mi>G</mi><msub><mi>D</mi><mi>j</mi></msub></msub></mro
   w></mfenced><mrow><mi>m</mi><mi>i</mi><mi>n</mi><mfenced close=")"
   open="("><mrow><msub><mi>G</mi><msub><mi>D</mi><mi>i</mi></msub></msub>
   <mo>,</mo><msub><mi>G</mi><msub><mi>D</mi><mi>j</mi></msub></msub></mro
   w></mfenced></mrow></mfrac></mfenced><mo>×</mo><mn>100</mn></mrow>
   :MATH]

   where
   [MATH: <msub><mi>G</mi><msub><mi>D</mi><mi>i</mi></msub></msub> :MATH]
   and
   [MATH: <msub><mi>G</mi><msub><mi>D</mi><mi>j</mi></msub></msub> :MATH]
   are PCOS genes associated with diseases
   [MATH: <msub><mi>D</mi><mi>i</mi></msub> :MATH]
   and
   [MATH: <msub><mi>D</mi><mi>j</mi></msub> :MATH]
   .

   The score is directly proportional to the number of shared genes; hence
   a higher score indicates a higher risk of comorbidity.

Based on the uniqueness of shared genes

   This method is based on the observation that diseases, whose genes are
   not associated with multiple diseases, have a higher comorbidity risk
   as compared to diseases caused by genes associated with multiple
   diseases^[159]82.

   The uniqueness of ith gene ‘
   [MATH: <msub><mi>g</mi><mi>i</mi></msub> :MATH]
   ’ associated with diseases
   [MATH: <msub><mi>D</mi><mi>i</mi></msub> :MATH]
   [,]
   [MATH: <msub><mi>D</mi><mi>j</mi></msub> :MATH]
   is calculated as:
   [MATH:
   <mrow><mi>U</mi><mi>n</mi><mi>i</mi><mi>q</mi><mi>u</mi><mi>e</mi><mi>n
   </mi><mi>e</mi><mi>s</mi><mi>s</mi><mfenced close=")"
   open="("><msub><mi>g</mi><mi>i</mi></msub></mfenced><mo>=</mo><mfenced
   close="]"
   open="["><mrow><mn>1</mn><mo>-</mo><msqrt><mfrac><msub><mi>D</mi><msub>
   <mi>g</mi><mi>i</mi></msub></msub><msup><mi>D</mi><mi>T</mi></msup></mf
   rac></msqrt></mrow></mfenced></mrow> :MATH]

   where
   [MATH: <msup><mi>D</mi><mi>T</mi></msup> :MATH]
   represents the total number of diseases in the gene-disease dataset and
   [MATH: <msub><mi>D</mi><msub><mi>g</mi><mi>i</mi></msub></msub> :MATH]
   is the number of diseases associated with
   [MATH: <mrow><mi>i</mi><mtext>th</mtext></mrow> :MATH]
   gene.

   If
   [MATH:
   <mrow><mi>n</mi><mi>g</mi><mi>e</mi><mi>n</mi><mi>e</mi><mi>s</mi><mo>∈
   </mo><msub><mi>D</mi><mi>i</mi></msub><mo>∩</mo><msub><mi>D</mi><mi>j</
   mi></msub></mrow> :MATH]
   then, comorbidity of each disease pair is calculated as follows:
   [MATH:
   <mrow><mi>C</mi><mi>o</mi><mi>m</mi><mi>o</mi><mi>r</mi><mi>b</mi><mi>i
   </mi><mi>d</mi><mi>i</mi><mi>t</mi><msub><mi>y</mi><mrow><mi
   mathvariant="italic">uniqueness</mi></mrow></msub><mfenced close=")"
   open="("><mrow><msub><mi>D</mi><mi>i</mi></msub><mo>,</mo><msub><mi>D</
   mi><mi>j</mi></msub></mrow></mfenced><mo>=</mo><munderover><mo
   movablelimits="false">∑</mo><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow>
   <mi>n</mi></munderover><msub><mfenced close="]"
   open="["><mrow><mi>U</mi><mi>n</mi><mi>i</mi><mi>q</mi><mi>u</mi><mi>e<
   /mi><mi>n</mi><mi>e</mi><mi>s</mi><mi>s</mi><mfenced close=")"
   open="("><msub><mi>g</mi><mi>i</mi></msub></mfenced></mrow></mfenced><m
   i>n</mi></msub></mrow> :MATH]

   The score is directly proportional to the number of uniquely shared
   genes, hence a higher score indicates a higher risk of comorbidity for
   the pair of diseases.

Based on the biological process and molecular function of associated genes

   This algorithm is based on the inference that 95% of disease links can
   be predicted by the functional overlap of the associated genes^[160]81.
   Disease pair comorbidity risk is calculated and scored as per the
   standard Jaccard index^[161]83.
   [MATH:
   <mrow><mi>C</mi><mi>o</mi><mi>m</mi><mi>o</mi><mi>r</mi><mi>b</mi><mi>i
   </mi><mi>d</mi><mi>i</mi><mi>t</mi><msub><mi>y</mi><mrow><mi
   mathvariant="italic">ontology</mi></mrow></msub><mfenced close=")"
   open="("><mrow><msub><mi>D</mi><mi>i</mi></msub><mo>,</mo><msub><mi>D</
   mi><mi>j</mi></msub></mrow></mfenced><mo>=</mo><mfenced close="]"
   open="["><mfenced close="|"
   open="|"><mfrac><mrow><mi>G</mi><msub><mi>O</mi><mi>i</mi></msub><mo>∩<
   /mo><mi>G</mi><msub><mi>O</mi><mi>j</mi></msub></mrow><mrow><mi>G</mi><
   msub><mi>O</mi><mi>i</mi></msub><mo>∪</mo><mi>G</mi><msub><mi>O</mi><mi
   >j</mi></msub></mrow></mfrac></mfenced></mfenced><mo>×</mo><mn>100</mn>
   </mrow> :MATH]

   where
   [MATH: <mrow><mi>G</mi><msub><mi>O</mi><mi>i</mi></msub></mrow> :MATH]
   and
   [MATH: <mrow><mi>G</mi><msub><mi>O</mi><mi>j</mi></msub></mrow> :MATH]
   are the set of distinct molecular functions and biological processes
   for genes of diseases i and j respectively as retrieved from Gene
   Ontology (GO) database.

   The score is directly proportional to the functional overlap of
   disease-associated genes and therefore higher score indicates a higher
   risk of comorbidity for the pair of diseases.

Based on network separation of disease genes in the human interactome

   Diseases whose genes are located closer in the human interactome have a
   higher probability of co-occurrence as compared to diseases with genes
   spread apart in the network^[162]84. Experimentally validated human
   protein–protein interactions from STRING v11^[163]68 were used for the
   algorithm. The comorbidity score is calculated as:
   [MATH:
   <mrow><mi>C</mi><mi>o</mi><mi>m</mi><mi>o</mi><mi>r</mi><mi>b</mi><mi>i
   </mi><mi>d</mi><mi>i</mi><mi>t</mi><msub><mi>y</mi><mrow><mi>S</mi><mi>
   h</mi><mi>o</mi><mi>r</mi><mi>t</mi><mi>e</mi><mi>s</mi><mi>t</mi><mspa
   ce
   width="0.277778em"></mspace><mi>p</mi><mi>a</mi><mi>t</mi><mi>h</mi></m
   row></msub><mfenced close=")"
   open="("><mrow><msub><mi>D</mi><mi>i</mi></msub><mo>,</mo><msub><mi>D</
   mi><mi>j</mi></msub></mrow></mfenced><mo>=</mo><msub><mi>D</mi><mrow><m
   i
   mathvariant="italic">ij</mi></mrow></msub><mo>-</mo><mfrac><mrow><msub>
   <mi>D</mi><mrow><mi
   mathvariant="italic">ii</mi></mrow></msub><mo>+</mo><msub><mi>D</mi><mr
   ow><mi
   mathvariant="italic">jj</mi></mrow></msub></mrow><mn>2</mn></mfrac></mr
   ow> :MATH]

   where
   [MATH: <msub><mi>D</mi><mrow><mi
   mathvariant="italic">ii</mi></mrow></msub> :MATH]
   and
   [MATH: <msub><mi>D</mi><mrow><mi
   mathvariant="italic">jj</mi></mrow></msub> :MATH]
   is the average of minimum distances of each gene associated with
   disease i and j respectively and
   [MATH: <msub><mi>D</mi><mrow><mi
   mathvariant="italic">ij</mi></mrow></msub> :MATH]
   is the average of minimum distances between genes of diseases i and j.

   Since the score represents the network-based separation of
   disease-associated genes, a lower score indicates higher risk of
   comorbidity for the pair of diseases.

Network analysis

   This tool can be used for visualization of disease networks,
   identification of enriched pathways, and prioritization of disease
   genes. Vis.js^[164]85 visualization library was used for dynamic
   network creation and visualization. The tool has three modules as
   described below.

Disease-disease network

   A dynamic subset of the human disease network^[165]86 can be created
   for a selected group of diseases. Diseases are represented as nodes and
   the size of a node is proportional to the number of genes or miRNAs
   associated with the disease. Disease nodes are connected by edges based
   on the number of shared genes or miRNAs between them. Users can select
   multiple diseases for the identification of enriched pathways in these
   diseases.

Pathway enrichment analysis

   The disease-pathway associations are inferred based on mapping
   disease-associated genes and target genes of associated miRNAs to their
   pathways^[166]87. Enriched pathways are identified based on
   hypergeometric distribution with the threshold p value set as 0.05
   (gene dataset) and 0.001(miRNA dataset) based on the data size. Users
   can select pathways and visualize the network. Each pathway is
   represented as a node and is connected to other pathways in the network
   based on common genes or miRNAs. The thickness of the edge is
   proportional to the number of shared genes or miRNAs. If gene dataset
   is selected then, the enriched pathways can be examined for the
   identification of critical hub and bottleneck genes through the Gene
   network analysis module.

Gene network analysis

   Experimentally validated interactions from STRING v11^[167]68 were used
   for creating gene interaction networks for enriched pathways. Critical
   genes in these pathways were identified based on network topological
   properties such as degree, closeness centrality, and betweenness
   centrality calculated using graph package in R^[168]88. The hub and
   bottleneck genes were defined based on the study of Rakshit et
   al.^[169]89.

   Hub genes: Degree > (Mean of Degree + (2* Standard Deviation)) OR
   Closeness centrality > (Mean of closeness centrality + (2* Standard
   Deviation)).

   Bottleneck genes: Degree < (Mean of Degree) AND Betweenness
   centrality > (Mean of Betweenness centrality).

Venn analysis

   The common and unique list of genes, pathways, and ontologies can be
   identified for a selected list of diseases using this tool. jvenn
   source code ^16 was used to develop the interactive 6-way Venn diagram.

Supplementary information

   [170]Supplementary Information 1.^ (834.6KB, docx)
   [171]Supplementary Information 2.^ (989.6KB, xlsx)

Acknowledgements