Graphical abstract [31]graphic file with name ga1.jpg [32]Open in a new tab Abbreviations: PCOS, Polycystic ovary syndrome; DEGs, Differentially expressed genes; GEO, Gene Expression Omnibus Keywords: PCOS, Differential gene expression, Pathway analysis, Enrichment analysis, Comorbidity analysis Abstract Polycystic ovary syndrome (PCOS) is a complex multigenic disorder and women with PCOS suffer from several comorbidities. Although, obesity is a known risk factor for PCOS, the incidence of lean women with PCOS is on the rise. A systematic and comparative study on lean and obese PCOS with respect to genes, pathways and comorbidity analysis has not been attempted so far. Analysis of differentially expressed genes (DEGs) across tissue types for lean and obese PCOS revealed that the majority of them were downregulated for lean and obese PCOS. Ovarian and endometrial tissues shared several commonly dysregulated genes, suggesting shared PCOS pathophysiology mechanisms exist across tissues. Several pathways for cellular homeostasis, such as inflammation and immune response, insulin signaling, steroidogenesis, hormonal and metabolic signaling, regulation of gonadotrophic hormone secretion, cell structure and signaling that are known to be affected in PCOS were found to be enriched in our gene expression analysis of lean and obese PCOS. The gene-disease network is denser for obese PCOS with a higher comorbidity score as compared to lean PCOS. 1. Introduction Polycystic ovary syndrome (PCOS) is the most common endocrinological and metabolic disorder reported in women of reproductive age. The cause of the disease can be attributed to genetic and lifestyle factors [33][1]. The underlying pathophysiology of PCOS, based on our current understanding, can be mainly attributed to elevated LH (Luteinizing Hormone)/FSH (Follicle Stimulating Hormone) ratio and/or insulin [34][2]. The diagnosis of PCOS is essentially based on three features which include the presence of hyperandrogenism, menstrual irregularity and polycystic ovaries [35][3]. While obesity is a known risk factor for PCOS, not all women with obesity develop PCOS and not all women with PCOS are obese [36][4], [37][5]. Around 30–70% of women, belonging to diverse ethnicities, are affected by PCOS and obesity [38][6]. On the other hand, 20–50% of women with PCOS are normal weight/lean and the pathophysiology may vary in these two phenotypes [39][7]. Metabolic syndrome, which is a constellation of conditions such as hypertension, abdominal obesity, insulin resistance and hypercholesterolemia, is commonly seen in women with obesity and PCOS [40][8], [41][9]. Dyslipidemia and insulin resistance are more pronounced in obese PCOS as compared to lean PCOS; suggestive of dissimilar metabolic profiles in these phenotypes [42][4], [43][10], [44][11], [45][12]. For the same reason, the incidence of acanthosis nigricans and impaired lipid profiles and glucose tolerance, which are indicators of insulin resistance, are more widespread in obese PCOS [46][13]. Altered secretions of adipokines such as adiponectin (ADIPOQ), leptin (LEP) and resistin (RETN) by adipose tissues is one of the important contributory factors to insulin resistance, cardiovascular diseases and metabolic disorders [47][14], [48][15]. ADIPOQ is downregulated while LEP and RETN are upregulated in obese conditions [49][14], [50][15]. Accordingly, levels of ADIPOQ have been found to be lower in obese PCOS as compared to lean PCOS, and levels of LEP gene have been reported to be lower in lean PCOS as compared to obese PCOS [51][15], [52][16]. Levels of RETN were found to be similarly upregulated in obese and lean PCOS cases as compared to controls [53][14]. Although PCOS and obesity are characterized by increased androgen production, the bioavailable androgen levels are normal in obese non-PCOS cases as compared to PCOS, due to its high clearance rate [54][5]. Sex hormone binding globulin (SHBG) plays a major role in metabolic clearance of free androgens and other hydroxysteroid ligands to the target tissues and liver. Lower serum levels of SHBG in PCOS leads to elevated levels of circulating androgens [55][17], [56][18]. The androgen levels are elevated similarly in lean and obese PCOS cases [57][19]. With respect to hormonal profiles of lean and obese PCOS phenotypes, levels of LH, FSH, LH to FSH ratio, free testosterone, dehydroepiandrosterone (DHEA), anti-müllerian hormone (AMH), estradiol and progesterone are similar in both the phenotypes [58][19], [59][20]. The factors associated with PCOS such as anovulation, insulin resistance and altered steroidogenesis are known to increase the risk of cancers in females with PCOS [60][21], [61][22], [62][23]. Amongst the reproductive cancers, clinical studies have reported that women suffering from PCOS have a higher risk of suffering from endometrial cancer [63][24] followed by ovarian cancer [64][22]. The mortality rate of ovarian cancer for women who are suffering from obesity and PCOS women is higher as compared to lean women [65][25]. Although few studies have suggested that the obesity and anovulation in PCOS women can increase the risk of breast cancer [66][21], [67][22], the association of breast cancer and PCOS is undecided [68][26], [69][27]. The information currently available for lean PCOS is scarce as most of the reported literature is based on patients managed in hospital or fertility clinics, which is known to better represent obese PCOS [70][19]. Probably for the same reason, there are inconsistencies in the observations from the genetic studies [71][28]. There is a need to systematically study and compare the gene expression profiles of lean and obese PCOS to gain a more complete understanding of the syndrome. Here, we identify differentially expressed genes, enriched pathways and associated comorbidities for lean and obese PCOS, based on systematically reviewed and analyzed lean and obese PCOS data from the Gene Expression Omnibus (GEO) [72][29]. We used a meta-analysis approach, where each study containing cases and controls was normalized and analyzed individually to identify differentially expressed genes and enriched pathways and then these results were compared across studies. Information available in literature was used to validate some of the resulting observations. The study has helped to generate novel mechanistic hypotheses for lean and obese phenotypes of PCOS and also to validate existing observations such as higher comorbidity in women who are obese and suffer from PCOS as compared to lean PCOS [73][19]. 2. Methods The workflow adopted in this study is illustrated in [74]Fig. 1. Fig. 1. [75]Fig. 1 [76]Open in a new tab Summary of the workflow adopted in this study. 2.1. Data collection and microarray gene expression The Gene Expression Omnibus (GEO) database was searched on 30th September 2019 to retrieve human microarray gene expression studies on PCOS using the query terms (((“polycystic ovary syndrome”) AND “Homo sapiens”[Organism]) AND gse[Filter]) AND “Expression profiling by array” [Filter]. 26 datasets were identified using this query, which were further manually curated for excluding studies that involved lncRNA, drug-treated samples, cell line studies and non-BMI matched samples. Eight GEO datasets ([77]GSE98421, [78]GSE5850, [79]GSE10946, [80]GSE6798, [81]GSE98595, [82]GSE48301, [83]GSE5090 and [84]GSE43264) qualified for the meta-analysis ([85]Table 1, Supplementary Table S1 and [86]Fig. 1). The women in these studies were classified as lean/non-obese (BMI ≤ 23) or obese (BMI > 23) based on their body mass index (BMI). In case of [87]GSE98421, the BMI of the women are not provided in GEO; however the study states that the tissue samples are from lean PCOS patients and hence this study was categorized under lean PCOS. Table 1. Details of the GEO datasets used in the study. Category GSE Ids Cell / tissue type Sample size __________________________________________________________________ BMI __________________________________________________________________ Platform Ids Platforms Age (Years) No. of Probes/genes Control PCOS Control PCOS Lean/ Non-obese [88]GSE98595 Lutein granulosa cells 3 3 22.04 22.39 [89]GPL6244 Affymetrix Human Gene 1.0 ST Array Control: Average- 27.33, PCOS: Average- 28 33,297 [90]GSE98421 Subcutaneous adipose tissue 4 4 – – [91]GPL570 Affymetrix Human Genome U133 Plus 2.0 Array – 54,675 [92]GSE10946 Cumulus cells 6 5 20 ± 4 22 ± 3 [93]GPL570 Affymetrix Human Genome U133 Plus 2.0 Array Average: 31 (range 29–32) 54,675 Obese [94]GSE10946 Cumulus cells 5 7 31 ± 4.5 32 ± 4.5 [95]GPL570 Affymetrix Human Genome U133 Plus 2.0 Array Average: 31 (range 29–32) 54,675 [96]GSE6798 Skeletal muscle 13 16 34.0 ± 1.8 34.1 ± 1.1 [97]GPL570 Affymetrix Human Genome U133 Plus 2.0 Array Control: 34.7 ± 2.0, PCOS: 30.8 ± 1.8 54,675 [98]GSE5850 Metaphase II oocyte 6 6 23.64 ± 1.33 30.98 ± 3.80 [99]GPL570 Affymetrix Human Genome U133 Plus 2.0 Array Control: 31.30 ± 0.79, PCOS: 30.56 ± 1.41 54,675 [100]GSE43264 Subcutaneous adipose tissue 7 8 38.25 38.14 [101]GPL15362 NuGO array (human) NuGO_Hs1a520180 – 17,126 [102]GSE48301 Endometrial epithelial cells (eEP), endothelial cells (eEN), stromal fibroblasts (eSF) and mesenchymal stem cells (eMSC) 15 14 35.73 ± 3.96 ≥ 27 [103]GPL6244 Affymetrix Human Gene 1.0 ST Array Control: 36.50 ± 1.70, PCOS: 30.5 ± 2.1 33,297 [104]GSE5090 Omental adipose tissue 8 9 52.8 + 5 51.0 + 10.2 [105]GPL96 Affymetrix Human Genome U133A Array Control: 40.4 ± 3.6, PCOS: 31.6 ± 7.9 22,283 [106]Open in a new tab 2.2. Microarray data pre-processing and identification of differentially expressed genes (DEGs) The CEL files were retrieved for the selected GEO datasets and each dataset was analyzed following a meta-analysis approach, where each case-control study was separately analyzed from raw data to the differentially expressed gene stage, then these DEG lists were compared across studies [107][30]. In particular, the raw data available in each of the CEL files of the selected GEO datasets were background corrected and quantile normalized. Probe sets were summarized using the Robust Multi-Array Average (RMA) algorithm implemented in the affy [108][31] and oligo [109][32] R packages. Relevant and updated annotations were retrieved for the probesets using the biomaRt [110][33] R Package. Differential gene expression was calculated using the limma [111][34] R package. Statistically significant DEGs were determined based on p-value (p < 0.05) and log fold change (logFC > 2 for upregulated and < -0.5 for downregulated genes). The DEGs were identified with reference to PCOS cases versus controls for each of the GEO datasets. 2.3. Analysis using DEGs The DEGs obtained from each dataset were compared to detect commonly dysregulated genes in PCOS across diverse sample types and GEO platforms. For tissue-based analysis, the aforementioned list of DEGs were clustered based on their tissue source to identify commonly dysregulated genes across the tissue types. The array expression datasets were grouped based on their source, into four tissue types, namely ovarian, endometrial, adipose and skeletal. The ovarian group included metaphase II oocyte, cumulus cells and lutein granulosa cells. The endometrial group included cell types of epithelial, endothelial, stromal fibroblasts and mesenchymal stem cells. The adipose group had subcutaneous adipose tissue and omental adipose tissue. The skeletal group had skeletal muscle tissue. 2.4. Pathway enrichment analysis Pathway enrichment analysis was performed using the Gene Set Enrichment Analysis (GSEA) method. The Java desktop application of GSEA (v. 3.0) developed by Broad Institute was used to identify statistically enriched gene sets and pathways in each of the datasets [112][35]. The Bader lab human gene set database containing updated information collected from various pathway databases such as GO [113][36], Reactome [114][37], KEGG [115][38] and MsigDB [116][39], excluding annotations that have evidence code IEA (inferred from electronic annotation), ND (no biological data available) and RCA (inferred from reviewed computational analysis) were used for GSEA analysis [117][40]. The “Max” size was set at 200 and “Min” size was set at 10 in order to remove the “too general” and “too specific” gene sets and pathways, respectively. The number of permutations was set to 2000. The analysis was performed using the weighted enrichment statistic, using the default weight set to p = 1. The GSEA output and normalized expression data were used to perform Enrichment analysis. The Enrichment Map Analysis Pipeline [118][41] in Cytoscape version 3.6.1 [119][42] was used to visualize the pathway enrichment analysis results. All the parameters were set to their defaults. FDR q-value and p-value cutoff were set at 0.1 and 1.0, respectively. For all datasets of [120]GPL570, individual networks were created. A master network was created using all tissue types of [121]GPL570 only, except metaphase II oocyte (as oocyte may have distinct cellular events and metabolic pathways) for identifying the common and unique enriched pathways in lean and obese PCOS. The AutoAnnotate [122][43] Cytoscape app was used to identify clusters present in the enrichment map for grouping redundant pathways and ease of interpretation. 2.5. Comorbidities and disease distribution in lean and obese PCOS The KEGG disease database (Release 88.2) was used to obtain information for human diseases and its associated genes. The diseases were categorized as per International Classification of Diseases 11th Revision (ICD-11). The KEGG gene and disease IDs were used for mapping genes with the highest level of ICD11 classification. The DEGs were further mapped to this data to obtain the gene-disease association score (GDS) for PCOS. For each ICD-11 category, GDS was calculated as below [MATH: GDS=NumberofDEGsfrommetaanalysismappedtothediseaseTotalnumberofgenesmappedtothedisease×100< /mrow> :MATH] Gene-disease association was further used to construct gene-disease networks for lean and obese PCOS. 2.6. Gene set variation analysis (GSVA) The R packages GSVA [123][44] and complexheatmap [124][45] were used to generate a heatmap displaying the variation of gene sets in different tissues of women with PCOS. The gene expression matrix (logFC values) was analyzed by GSVA using gene set wherein each gene set contains a list of genes associated with diseases classified by ICD-11 codes from the KEGG database. GSVA was performed to understand the regulation of genes across the ovarian tissue types for lean and obese PCOS ([125]GPL570 platform only) and its impact on reproductive and endocrine diseases. 3. Results 3.1. The majority of differentially expressed genes in PCOS are downregulated A total of 5014 (unique = 4224) statistically significant DEGs were identified by analyzing the eight GEO datasets individually (see Method section 2.2, Supplementary Table S2). Of these, 123 (unique = 96) genes were upregulated and 4891 (unique = 4101) genes were downregulated (Supplementary Table S3). Regardless of tissue type and phenotype, the majority of the genes were downregulated in PCOS. Endometrial cells of obese PCOS and cumulus cells of lean PCOS displayed the highest number of dysregulated genes ([126]Fig. 2). Seven (ETV3, GABPB1, ELF3, GABPA, ELF1, ELF4 and SRF) genes were identified to encode transcription factors using the iRegulon Cytoscape app [127][46] (Supplementary Table S4). Of the 4224 unique DEGs, the association of 136 genes with PCOS has been established in the literature. The links to the relevant publications can be viewed under the “Literature Evidence” column in Supplementary Table S2. Fig. 2. [128]Fig. 2 [129]Open in a new tab DEGs across all tissue types. For each dataset (x-axis), DEGs were identified based on p-value <0.05 and logFC >2 for upregulation (red bars) and <−0.5 for downregulation (blue bars). (For interpretation of the references to colour in this figure legend, the reader is referred